I recently wrote a blog complaining that LLM benchmarks do a bad job of assessing NLG. I got a lot of feedback and comments on this, which highlighted to me that there were lots of problems with LLM benchmarks and benchmark suites.
Below I describe what I would like to see in good benchmarks and benchmark suites. Note that this blog is just about automatic benchmarks, I do not discuss human evaluation such as Chatbot Arena. Also the metrics discussed below do not address worst-case performance or safety issues.
Properties of good benchmarks
I think a good LLM benchmark should have the following properties:
- Correct and trustworthy. We know that MMLU is quite buggy (Gema et al 2024), ie correct answers can be scored as incorrect, and vice-versa. Unfortunately in my experience NLP researchers in general often are sloppy and do not care much about experimental rigour (blog), so I suspect other benchmarks are also buggy. A good benchmark should be rigorous and go through a careful quality assurance process, so that it can be trusted.
- Test data not leaked (data contamination). For example, GSM8K has been available for years on the web (link), which means it is possible that current LLMs have been trained on it.
- Challenging. If many LLMs score near-perfect on a benchmark, then it does not do a good job of distinguishing between them. Note that near-perfect does not necessarily mean 100%. For example, many models get close to 90% on MMLU, which may be the maximum possible score because of above-mentioned bugs (which means some correct answers are scored as being wrong). Models which are too easy include MMLU, GSM8K, and MATH (Glazer et al 2024).
- Modern evaluation techniques. A benchmark should not rely on outdated metrics such as BLEU, when there are better alternatives available.
- Replicable. Because of data contamination, test data should not be released, which means public replicability is not possible. But it should be possible for other sites (who agree to keep data confidential) to replicate a benchmark and get similar results.
- Clear scope. The benchmark should be accompanied by a clear and accurate statement about what capabilities it tests. For example XSum should be described as testing the ability to generate the first sentence of a news article, it should not be described as a summarisation evaluation (blog).
I hope the above are not controversial, but unfortunately (as can be seen above) some widely used benchmarks do not satisfy the above criteria. Incidentally, both the Amazon Nova evaluation (which I referenced in my earlier blog) and the Stanford Helm-Lite suite report scores for MMLU, GSM8K, and MATH; both also use BLEU scores for MT evaluation (Amazon also reports Comet scores, which are much better).
Properties of good benchmark suites
A good suite (set) of benchmarks should assess many capabilities of an LLM, by using a set of individual benchmarks. It seems to be standard practice to use such suites to assess LLMs. Of course the individual benchmarks in a suite should be good benchmarks, as described above, but in addition the suite itself should have some properties:
- Clarity: It should be clear which benchmarks are in the suite! I mention this because I looked at the evaluation suite for DeepSeek (Github), which has been in the news recently, I didnt know what some of the benchmarks were, and there was no links to papers, Github, etc. I could do a Google search, but some benchmarks have multiple versions, so explicit links are much better.
- Diversity: The suite should assess many different capabilities. For example, a suite might contain 1-2 benchmarks to assess knowledge, 1-2 to assess reasoning, 1-2 to assess language understanding, 1-2 to assess language generation, 1-2 to assess dialogue, etc. What I think is dubious is to have many benchmarks which assess similar things. For example, the Helm-lite suite contains 7 benchmarks which assess knowledge via question answering, 2 benchmarks which assess mathematical problem solving, and one machine-translation benchmark. I think this unbalanced design detracts from the benchmark’s utility.
- Real-world grounding: At least some (half?) of the benchmarks in the suite should assess utility in real-world applications, such as machine translation, generating financial reports, and summarising news articles on a topic. Its fine for some members of the suite to be generic, but if the purpose of the suite is to help users make informed choices, then it also needs to contain some assessment of real-world utility, in at least some use cases. I didnt see anything that looked real-world in the DeepSeek language eval suite, although perhaps I missed something because of clarity problems (as above); there were some coding benchmarks which seemed closer to real-world.
Real-world grounding is mainly important if the purpose of the benchmark (suite) is to allow users to make informed choices about which LLM (if any) they should use. If the purpose is something else (more abstract exploration of what LLMs can do?), then it may not be important.
Using benchmarks to choose an LLM
Lets suppose Susan wants to use benchmarks to help decide which LLM to use in a system which summarises doctor-patient consultations. She would probably first look for benchmarks specifically on this task; however I am not aware of any such benchmarks.
Susan would probably then look for general benchmarks on summarisation. There are plenty of metrics here. Unfortunately the most common ones are ROUGE scores on CNN/DM and XSUM datasets, which are very problematical and lack most of my properties for good benchmarks. But of course Susan may not realise this, since there is no easily-acessed trusted source which could tell her this.
Alternatively, Susan could look at general benchmark suites such as the ones mentioned above, and assume that an LLM which did well at question-answering and math reasoning would also do well at summarising medical documents. This is a very dangerous assumption to make.
In short, current benchmarks are of little help to Susan, which is especially disappointing given the huge amounts of money being spent on developing and using LLMs. Tens of billions are spent annually on LLMs (and associated stocks are worth trillions), so it is astonishing to me that we dont have more useful benchmarks to guide people who want to use LLMs.
Final thoughts
Until very recently I paid little attention to LLM benchmarks (I am more interested in human evaluation and worst-case/safety evaluation) but I did assume that the benchmark numbers thrown around when a new LLM was announced meant something. Unfortunately it looks like many of them do not mean much, and also the set as a whole is not as useful as it could be. Benchmark suites from universities, such as Stanford’s Helm-Lite, seem just as bad as benchmarks used by vendors to promote their products.
One issue is that benchmarks age and become less useful but are still used. Of course this is an old story in NLP and has happened before (blog). On the positive side, new benchmarks are constantly being announced which meet the above criteria and genuinely tell us something about models; recent examples include FrontierMath (which addresses math problems which are very difficult for humans) and NovelChallenge (which addresses reading comprehension tasks that are easy for humans but challenging for LLMs). However, there still seems to be little interest in benchmarks that assess real-world utility.
I am disappointed in the current state of affairs, I hope it improves…
Postscript 5-Jan-25: Just saw a reference on Twitter to Killed by LLM, which essentially lists popular LLM benchmarks that are too easy. This lists includes quite a few of the benchmarks used by Amazon and DeepSeek.
Postscript 31-Jan-25: Griot et al 2024 point out that LLM are very bad at saying “I dont know” to questions. This is essential in real-world context with incomplete/incorrect/inconsistent/etc data, but as far as I know is not evaluated by any of the major “reasoning” benchmarks.
Turbo Seek – AI Search Engine
LikeLike