Stop Hammering Every Problem as a Nail: Why GPT Isn't a Magic Bullet (yet)

April 27, 2024

When Large Language Models (LLMs) first shot to fame with the launch of ChatGPT 3.5 in November 2022, I was cautiously optimistic that they would make traditional software development redundant. Even though I make my living designing and writing software, I was hoping to see entire stacks of convoluted architectures be replaced by a single ML model running on simple prompts. At first, I honestly believed that these models could be provided with an Agile user story, and they will take care of the rest.

As a database administrator, I want to automatically merge datasets from different sources so that I can more easily create reports for my internal customers.

To learn if that was indeed the case, I decided to pursue a short online course on prompt engineering which I followed up with a multi-week deep dive into LLMs. The courses had a lot to offer but the biggest takeaway was how one powerful they were out of the box and how one could improve their accuracy by providing them with a single example as part of the prompt. This technique, known as One Shot Prompting, seemed in line with my expectations of turning Agile user stories into real world applications.

With these amazing tools and techniques under my belt, I started looking for opportunities to replace existing ways of doing things with LLMs. I found that (unsurprisingly) GPT did great with classification and sentiment analysis tasks like the prompt below.

Prompt:

Did the sentiment turn negative or positive in the provided text? Answer in one word.

But an unexpected circumstance occurred; the detonation produced by the Columbiad had the immediate effect of disturbing the terrestrial atmosphere, where an enormous quantity of vapor accumulated. This phenomenon excited general indignation, for the moon was hidden during several nights from the eyes of her contemplators.


GPT’s response:

Negative

Classification isn't something I use traditional software engineering for. At the same time, I was excited that this classification task, which would have required experts to set up was now available for everyone in the world. However, as soon as I asked questions that involved reasoning, or had a larger output space, GPT (even with one-shot prompting) began to struggle with the responses as you can see in the example below.

Prompt:

Analyze this text to get the count of special characters in it. A special character is any character not an alphabet or a number. Return the response in a JSON format only where the special character is the key and the count is the value. I have provided one example for your reference.

Input: Hi, I am Deepak.

Output: {",": 1, " ": 3, ".": 1}

Input: But an unexpected circumstance occurred; the detonation produced by the Columbiad had the immediate effect of disturbing the terrestrial atmosphere, where an enormous quantity of vapour accumulated. This phenomenon excited general indignation, for the moon was hidden during several nights from the eyes of her contemplators.

Output:


GPT’s response:

{ ";": 2, ",": 2, " ": 39, ".": 4, "-": 1 }

As you can see, there are (at least) two issues with the response. First, it got the count wrong for the space character. Second, it produced a count for a dash character, which is not present in the text at all. As an end-user of GPT, I cannot say why it produced this output. Was it because it misunderstood my prompt, or was it because its reasoning abilities are flawed? I had to read my prompt several times just to verify that there was no dash character indeed. This phenomenon is called 'hallucination' in LLMs as they produce a response that is factually inaccurate (aka hallucinating).

This put me off as being able to correctly understand, reason and solve problems such as these forms the basis of software engineering. I wondered if it could not solve even this, what chance would it stand against translating a Product Manager's Agile stories? In addition to this, I was also concerned by the data privacy policies of GPT models. OpenAI was briefly banned by Italy and is currently being sued by the New York Times. There is also a similar lawsuit ongoing against Google as well. Lastly, OpenAI’s uptime numbers do not paint a rosy picture. At the time of this blog's publication, their uptime for the last 90 days was at 99.74% which translates to roughly a day of downtime every year, a far cry from the four or five nines of uptime that leading cloud and financial providers offer.

Once I realized that GPT in its current form is limited to performing simple tasks, I thought maybe I could break down the problem. One idea was to have it write a program for me. This way, I can partially address the data security issue while completely eliminating the uptime challenges. So, I asked it to help me write a Python program for the same problem as before.

Prompt:

Write a python script to parse text to get the count of special characters in it. A special character is any character not an alphabet or a number. The script should output a dictionary with the key being the special character and the count being the value.


GPT’s response:

def count_special_characters(text):
   special_chars_count = {}

   # Iterate through each character in the text
   for char in text:
       # Check if the character is a special character (not alphanumeric)
       if not char.isalnum():
           # Increment the count of this special character in the dictionary
           if char in special_chars_count:
               special_chars_count[char] += 1
            else:
                special_chars_count[char] = 1

    return special_chars_count

Surprisingly, GPT did a great job with this! This put a whole different spin on using GPT. After this, I started viewing it as a developer productivity tool rather than an ML model running on a production environment. A new concern emerged about trusting the generated code, but I soon realized that existing techniques such as TDD could be an effective way to ensure its correctness.

This approach works great for simple use cases, but oftentimes (including when writing this blog), it takes me longer to come up with prompts for GPT to do a task (programming or otherwise) than it would for me to just do the task myself. In addition to the problem with hallucination, I find the other critical problem is the limited bandwidth of the text-based data interface between GPT and humans resulting in a slow feedback cycle. I personally thrive when I can quickly iterate through a number of ideas. Unfortunately, given how slow these models are, I find that my productivity takes a dip when using GPT. So after months of constant trial and error, I have set some personal guidelines on using GPT:

Use it for:

  • Translations between popular languages (natural and/or programming languages)
  • Repetitive classification and entity extraction tasks
  • Writing simple computer programs and scripts
  • Generating templates for documents
  • Finding flaws in my ideas

Don’t use it for:

  • Creating drafts of documents, emails etc. as creating the first draft is key to authenticity
  • Asking for advice in technical areas outside my expertise (I have no way of confirming whether it is hallucinating or not)
  • Writing complex computer programs
  • Any other one-time tasks which take only a few minutes manually

I hope the takeaway from this blog is not “human good, GPT bad” but that one should try to use the right tool for the right job. I understand that if the only tool you have is a hammer, you tend to see every problem as a nail. But if you fight that urge just a little bit, there might be other tools in your toolbox that can solve your problem or other problems where you can use your hammer. I am proud to announce that this forms the foundation of Hoistr's design as well. We are building a well tested system that utilizes a mix of ML models in addition to time tested software engineering practices.

Stay tuned to know more!

Deepak

Co-founder, Hoistr

← Back to home