academics

Comparing performance of LLMs is not very interesting

I recently wrote some posts on LinkedIn (etc) saying that comparing the performance of different LLMs in a research paper was not very interesting or useful. I got many questions about this, including from my own students, so I thought I would try to explain what I mean in more depth in this blog.

NLP papers are full of tables that compare performance of different LLMs on a task, it has become part of the “standard” way of writing NLP papers. But by the time a paper is published and read, the LLMs it gives data on will probably be obsolete and have been replaced by newer LLMs. So these tables are not very interesting. Why should anyone care whether an obsolete version of GPT is better or worse than an obsolete version of Gemini?

It is absolutely worth experimenting across multiple LLMs to look for shared behaviours! If all of the LLMs examined have similar behaviour or problems, then this means the behaviour may be generic and not just an idiosyncracy of one LLM, which makes it much more interesting. Looking at the highest score on a task across a set of LLMs can also be useful, because it tells us how close the problem is to being solved. But comparing scores between individual obsolete LLMs is not interesting.

In other words

  • Not interesting: Obsolete LLM1 gets X% higher metric score than obsolete LLM2.
  • Interesting: A set of obsolete LLMs have a maximum performance of Y on a task. If Y is high, then the problem is solved. If it is low, than more work is needed.
  • Very interesting: A set of obsolete LLMs have the same problems or failure modes; this may indicate a fundamental limitation of current LLM technology.

Example

I realise the above is pretty abstract. I will gave a concrete example based on the paper “Reliability of LLMs as medical assistants
for the general public: a randomized preregistered study” (link), which is one of the best papers I have recently read. It looks at how well LLMs can answer health questions from the general public, based on scenarios (people were asked to role-place scenarios, they did not describe their actual health problems).

Anyways, from the perspective of this blog, we see all of the above in this paper. The paper looked at three LLMs: GPT-4o, Llama 3 and Command R+ . At the time I am writing this blog, the latest models in these families are GPT 5.4, Llama 4, and Command A.

  • Not interesting: The paper does not emphasise this (one reason I like it), but for example they say that users success at identifying at least one medical condition which was relevant to a scenario was 0.42–0.54 with GPT-4o, 0.39–0.50 with Llama 3, and 0.34-0.4 with Command R+. So GPT-4o did better than Command R+, however this does not mean that GPT 5.4 will be similarly better than Command A.
  • Interesting: What is more interesting is that none of the models did very well, and indeed the paper reports that users in a control non-LLM group (eg, they used Google search to find health information) were more successful at this task, with a success rate of 0.55–0.67. This tells us that LLMs at the time the study was done were pretty bad at this task. Of course, it is possible that GPT 5.4 (etc) will do better at this task, but we cannot just assume this is the case! Claims that GPT 5.4 does much better at this task must be backed up by solid experimental data.
  • Very interesting: What is very interesting is that all of the LLMs did much better if they were given the scenario directly, than if users tried to explain the scenario to the LLM. In short, the LLMs had sufficient medical knowledge to deal with the scenarios, but this is not sufficient, communication skills are also important in patient-facing scenarios. This is well known to human doctors (GPs tell me that getting accurate information from patients is often the hardest part of diagnosis), but not something medical AI types talk about.

Actually, the most interesting finding of this paper for me personally was about methodology. It showed that you need to do high-quality experiments with people to really understand how effective an LLM is, you cannot rely on artifical tasks or LLM simulations of users (which the authors also tried).

Quantitative vs qualitative insights

The NLP community has a culture of focusing on performance benchmarks, showing that a new system is quantitatively better at some benchmark than an existing state-of-the-art system. But when NLP systems are based on rapidly evolving LLMs, quantitative performance is constantly changing, and claims based on quantitative performance will only be valid for a few months.

This applies to prompts as well as LLMs. I see many papers that effectively claim that prompt A is better than prompt B, and show this by comparing performance of LLMs with these two prompts. But if GPT4o does slightly better on a task with prompt A than with prompt B, will the same be true of GPT 5.4? Perhaps, but we cannot assume this will be the case.

And of course many NLP benchmarks do not mean much (blog), and many NLP experiments are flawed and hence not reliable (blog).

Qualitative insights about what technology can and cannot do, problems and issues, user requirements, etc hold their value much longer, and at least to me are much more interesting. Indeed, I think we need more qualitative evaluation (blog) as well as quantitative insights.

Final thoughts

When my students and others ask me about this topic, I say that

  • Investigating multiple LLMs on a task is very valuable is this leads to insights about what LLMs as a group (at least in this point in time) can and cannot do, and what problems and “failure modes” they have.
  • However quantitative comparisons of LLMs are not very useful, because they go out of date very quickly.
  • Qualitative insights are more useful and remain valid longer, compared to quantitative comparisons.

Leave a comment