Jason Stitt

The flaw holding LLMs back

“ChatGPT can make mistakes. Consider checking important information.”

This concise suggestion what the disclaimer at the bottom of ChatGPT has been simplified to, but it feels like volumes are contained within these simplistic statements. LLMs have this one stupid trick up their sleeves (users hate it). It’s called “hallucination.” And it’s tenacious.

In February, a court found that Air Canada had to honor incorrect information given to a customer on the airline’s policies. The airline’s legal defense hinged on the autonomy of the chatbot, which the court rejected (sensibly), saying that the chatbot was part of the airline’s website. The chatbot’s response, however, directly contradicted what other parts of the airline’s website said.

Although I haven’t yet found an article that reveals what specific model Air Canada’s chatbot used, or if it was something in-house and custom, the original incident happened in November 2022, which was relatively early in the development of LLMs. For reference, that was before GPT 3.5 was publicly released. I asked ChatGPT, which searched the web and summarized an article that, appropriately, didn’t actually contain the answer.

Then, in March, a New York City government chatbot was found to be giving incorrect answers, including answers about evictions, firing employees, tips, and bribery of public officials that were… the opposite of true. (More in this thread.) This was on a site intended specifically to give the public authoritative information and increase accessibility of legal knowledge.

According to the MyCity FAQ, “The virtual assistant chatbot, powered by Microsoft’s Azure AI technology and OpenAI’s ChatGPT chatbot framework, is a beta program that will continue to improve to better serve the needs of visitors to the MyCity portal.” I can’t find a disclosure of whether GPT-3.5 or GPT-4 is used, though GPT-3.5 is obviously more likely for a public bot for cost reasons, or of how a content library is used.

… should we still do chatbots?

At this point, I’d be nervous about releasing a new public chatbot. It’s not that it’s a bad idea, exactly, but what initially seemed like cool future-tech that just needed a disclaimer at the bottom that it was “experimental” now seems like something that will probably keep outputting at least some incorrect information for the foreseeable future, and not just during a brief “experimental” phase while the kinks are worked out.

That doesn’t mean we should stop trying to offer chatbots entirely — just that certain issues are persistent, not temporary, and need more than a disclaimer. The real problem here is that it looks like any original content generated by an LLM needs to be double-checked, and that adds friction to solution design.

The reason LLMs are so great is that they can do many things that you now don’t need other systems for. Many of the things they can do can be done in other ways, or could already be done before LLMs came out. And yet LLMs have caused a small revolution – why? Because they make almost everything they can do easier and more accessible. Accessibility is everything when it comes to adoption, and adoption is everything when it comes to innovation, because tech matters more when it’s actually used.

But if we can’t foist the output of LLMs directly on the unsuspecting public, it creates a big problem for solution design, because whatever thing has to go in between the LLMs and the public is a bit of a question mark and also decreases the accessibility of creating these solutions.

Most and least affected uses

“Generative AI” is an interesting term because people have really focused on the use cases that are the most “generative,” like writing content from scratch or responding to user queries as a chatbot.

But LLMs do a lot of different things – such as extraction, summarization, translation, and so on – and some of them are more prone to hallucinations than others.

To over-generalize, the bigger the outputs are compared to the inputs, the more I would expect something to be “made up,” i.e. hallucinated.

Write a detailed response to a simple question? More likely to make something up.

Write a short summary of a long article? Less likely to make something up.

This input-vs-output volume rule is a super rough rule of thumb, and I’m not trying to substitute it for more rigorous analysis. But having done a lot of summarization with LLMs so far, I’ve consistently found the errors, when they do occur, to be more misunderstandings of phrasing than outright inventions. Whereas asking simple questions routinely generates information that needs to be fact checked, code that almost but doesn’t quite run, etc.

How this affects my coding

I now use some combination of Github Copilot, GPT-4, or other models such as code-bison, for part of most new code I write. I’ve found them to be well worth the time and don’t mean to criticize them overall. However, there are some important caveats.

I’ve found that the ability to use these tools well is highly dependent on the ability to read and evaluate code quickly. They do make mistakes, and you do have to check on what they do. Sometimes I’ll have a model retry several times on the same function. Even doing this, it’s still often worth my time, although sometimes I’ll say “forget it” and just write the thing myself. This process relies on me (a) reading the code that the LLM generates, not just taking it as-is, and (b) being able to evaluate and accept or reject it in a short enough time that the method is still cost-effective overall.

Code reading is a learned skill, and I’ve noticed that it isn’t always something that developers get a lot of training on in traditional coding projects. In many cases, I’ve found that developers understand their own code a lot better than anyone else’s, and that furthermore, their understanding of their own code isn’t based on being able to read it better, but on remembering the process of writing it. In other words, once they forget writing the code, their understanding of their own code declines. Therefore, reading code that they didn’t write and understanding it quickly is difficult.

Another observation is that the name “copilot” is a really appropriate name. It produces much better results if the context is set up properly. For example, if you have an import statement for the module you’re about to use, and an unused result just above your cursor, and you start an assignment to a well-named variable, Copilot correctly gets what needs to happen a startlingly large amount of the time. But if your context isn’t set up that well, it will often produce something random-looking.

What do we put on top of LLMs?

If we don’t want to use LLMs only for internal use cases or put a huge disclaimer on them, there has to be something surrounding them to help deal with inaccuracy. But what is that thing and how do we drop it into solutions without compromising cost and simplicity too much? Obviously, having responses be human-checked works for only some solutions and not others, and it adds cost.

A possible solution is found in self-reflective agents, which generate content with an LLM query, then use another LLM query to critique it, then use another one to apply the critique… and, possibly, so on. This approach shows early promise but comes with obvious drawbacks including speed and cost.

Another possibility is structured output – that is, rather than directly expose the unstructured text output of an LLM, coerce the results of natural language processing into structured data, then expose that to traditional data quality checks before delivery. A related example would be text-to-query systems, that is, systems where you can type a query and have it translated into SQL or another query format. Although you still get a natural language interface, the LLM has no part in generating the results.

Whatever we do, I don’t think putting a disclaimer at the bottom of incorrect output will remain “cool” for very long.

© 2009-2024 Jason Stitt. These are my personal views and don't represent any past or present employer.