other

What LLMs cannot do

I’ve recently been reading papers on limitations on what LLMs can do (some of which were suggested to me in response to a request on Twitter; my thanks to everyone who responded!). There are some really interesting papers out there, all of which emphasise the core message that LLMs do not “think” like people do. We need to understand what LLMs can and cannot do on their own terms, and not treat them like pseudo-humans.

Anyways, since other people seem interested in the topic, I thought I’d mention a few of the more interesting papers I have looked at.

Stochastic Parrots

The classic paper on Stochastic Parrots by Bender et al is quite abstract, but makes some excellent general points about language models, such as the below quote. Well worth reading!

Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.

Bender et al 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proc of FaaCT.

Children vs LLMs

Yiu et al 2023 is a really interesting paper on stuff that children can do but LLMs cannot. Again its well worth reading. One thing that particularly struck me was a discussion of the “blicket detector task”, where young children are asked to figure out which combinations of objects will make a machine light up and play music. Yiu et al point out that children do this task much better than language models

LLMs produce the correct text in cases such as causal vignettes, in which the patterns are available in the training data, but often fail when they are asked to make inferences that involve novel events or relations in human thought (e.g., Binz & Schulz, 2023; Mahowald et al., 2023), sometimes even when these involve superficially slight changes to the training data (e.g., Ullman, 2023).

Yiu et al (2023). Transmission Versus Truth, Imitation Versus Innovation: What Children Can Do That Large Language and Language and-Vision Models Cannot (Yet). Perspectives on Psychological Science

What I found especially interesting was a comment later in the paper that sometimes LLMS could solve a blicket problem by finding the solution in a published psychology research paper. Ie, children solve blicket problems by reasoning about the task; LLMs solve blicket problems by searching for published solutions which are available on the internet.

Two very different approaches to problem-solving! And make sense if we think that humans are good at reasoning but poor at extracting solutions from terabytes of internet content, while LLMs struggle with many types of reasoning but are superb at extracting relevant content from the Internet?

Embers of Autoregression

McCoy et al 2023 is a fascinating paper which points out that LLMs do better at high-probability tasks with high-probability inputs and outputs. I was especially struck by their example of asking GPT4 to count the letters in a token; they showed that GPT was much more likely to give the correct answer for “iiiiiiiiiiiiiiiiiiiiiiiiiiiiii” than for “iiiiiiiiiiiiiiiiiiiiiiiiiiiii”. This is because “iiiiiiiiiiiiiiiiiiiiiiiiiiiiii” has 30 letters and “iiiiiiiiiiiiiiiiiiiiiiiiiiiii” has 29 letters, and “30” is a much more common token than “29” on the internet.

This seems bizarre if we assume that the “count the letters” task is solved by the kind of counting algorithm that a person would use. But of course GPT does not think like a person! McCoy et al give many other thought-provoking examples, which make it clear that we cannot predict GPTs behaviour by extrapolating from our experiences with human problem-solving.

Our experiments highlight two scenarios where AI practitioners should be careful about using LLMs. First,
we have shown that LLMs perform worse on rare tasks than on common ones, so we should be cautious about applying them to tasks that are rare in pretraining data. Second, we have shown that LLMs perform worse on examples with low-probability answers than ones with high-probability answers, so we should be careful about using LLMs in situations where they might need to produce low-probability text.

McCoy et al 2023. Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. Arxiv

GAIA Benchmark

Mialon et al 2023 propose a new benchmark, GAIA, which is specifically designed around real-world test cases which are relatively easy for humans but hard for LLMs. For example

Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina’s 2010 paper eventually deposited? Just give me the city name without abbreviations.

Mialon et al 2023. GAIA: A Benchmark for General AI Assistants. Arxiv

Human can easily answer this because a few minutes of web search will reveal that the title of Nedoshivina 2010 is “A catalogue of the type specimens of the tortricidae described by V.I. Kuznetsov from Vietnam and deposited in the Zoological Institute, St. Petersburg.” However LLMs struggle to answer this question.

The goal of the paper is to propose a new way to evaluate LLMs, but GAIA also gives insights on tasks which are hard for LLMs.

Final Thoughts

The overall message of all of the above papers is that LLMs in no sense “think” like people do. Which is exciting, because it means that human+LLM teams should be able to do things which humans alone (or indeed LLMs alone) cannot do! And I’ve certainly seen many times (including in medical contexts) that human+LLM combo can do a better job than human (or LLM) on its own.

I dont think this point is controversial, but it seems to often be ignored in practice, not least when people start talking about LLMs being “human-like” or indeed “superhuman”. We need to stop comparing LLMs to humans, and instead understand and evaluate what they can and cannot do, without assuming that their abilities in any way correlate with human abilities.

4 thoughts on “What LLMs cannot do

Leave a comment