I dont like leaderboards. As I wrote in a previous blog, most of the ones I see suffer from problems such as poor design, data contamination, incorrect or unrepresentative test data, and measuring things that users dont care about. A fixation on benchmarks also means that important capabilities which are hard to measure are not valued (blog).
Having said this, there are a few leaderboards and benchmarks that I have more respect for, because they are well-designed and sensible, and measure things that people care about. Perhaps not coincidentally, these leaderboards often get ignored in announcements about the fantastic achievements of the latest LLMs; its much easier to make amazing claims about dubious leaderboards.
However, it now looks like even the “good” leaderboards may not be very meaningful because the big LLM developers like OpenAI “game” the benchmarks. In other words, they tweak the benchmarks and leaderboard rules in order to improve the scores of their systems, which makes scores and comparisons much less meaningful.
SWE-Bench
One benchmark which I respect is SWE-bench, which assesses how well well LLMs can address real Github issues (bug reports or feature requests) on Github repositories for complex projects such as django. Success is measured by passing associated unit and system tests. SWE-Bench is by no means perfect and has many limitations (blog), but it is well designed and does measure something which real users care about.
Unfortunately, I suspect that OpenAI in particular is trying to modify SWE-bench to make its models look better, They introduced and pushed a modified version, SWE-bench Verified, which is now probably the most popular version of this benchmark. The modifications make sense and address real problems in the original benchmark; but there are many ways to address these problems, and I suspect OpenAI choose the approach which was most favourable to their models.
More recently, Alejandro Cuadron posted a thread on X which showed that the SWEBench-verified numbers reported by OpenAI for O1 were much higher than the numbers Cuadron got when he ran the benchmark himself on O1. Details are in the thread, but the short version is that OpenAI tweaked the benchmark to enhance O1’s numbers, in a way which could perhaps be justified, but makes “leaderboard” comparison with other systems meaningless. In short, they gamed SWEBench-Verified in order to make O1 look better.
Chatbot Arena
Singh et al (2025) is a fascinating paper which looks at how big LLM vendors improve their scores on Chatbot Arena. Arena is a human evaluation, where random people ask questions, are shown responses from two LLMs, and then say which response they preferred. As Singh et al state, Chatbot Arena is very influential, not least because it is based on real user questions and can naturally evolve as models get better. From my perspective, Arena (like SWEBench) has many limitations, but it is well designed and measures things that people care about, and I consider it to be one of the best LLM leaderboards.
However, Singh et al show that Arena is being gamed by the big LLM vendors. Details are in the paper, but the short version is that problems include the following:
- Big vendors submit many variants to Arena and only publish results from the best-performing one, which may not be the model’s public API. Of course any time multiple variants are evaluated, we expect that some will do well purely by chance (similar problem to running an experiment with 1000 different random seeds and only reporting the best result); so just reporting the score of the luckiest variant is not meaningful.
- Big vendors get much more data from Arena, which allows them to fine-tune models to optimise performance (Singh et al report that this process can double Arena scores).
- Because it relies on human evaluation, Arena drops models which are not doing well. Singh et al report that open-source models are much more likely to be dropped than closed models from OpenAI and Google.
Singh et al point out that the above problems can be fixed (and make recommendations), but they also conclude in Section 9 that
This work demonstrates the difficulty in maintaining fair evaluations, despite best intentions. We show that coordination among a handful of providers and preferential policies from Chatbot Arena towards the same small group have jeopardized scientific integrity and reliable Arena rankings. The widespread and apparent willful participation in the gamification of arena scores from a handful of top-tier industry labs is undoubtedly a new low for the AI research field. As scientists, we must do better. As a community, we must demand better.
Discussion
SWEBench and Chatbot Arena are the LLM leaderboards which I have the most respect for. So it is discouraging to discover that their validity is being compromised by the big LLM vendors gaming the leaderboards. These benchmarks are much more complex than multiple-choice benchmarks such as MMLU, which makes them more meaningful as predictors of real-world utility, but also probably easier to game.
Of course, vendors have strong incentives to maximise leaderboard positions, since in 2025 these are worth real money (especially from investors), they are not just about academic papers and pecking-order. So perhaps the above-mentioned behaviour should not be a surprise.
Anyways, the key message is that even the best benchmarks and leaderboards may not mean much, which makes it all the more annoying that academics and media are fixated on leaderboard performance instead of real-world impact. I realise that media and investors want simple-to-interpret numbers (as a journalist recently emphasised to me), but its a real shame that (as Singh et al say) the academic community goes along with this instead of insisting on more meaningful evaluation of LLMs.
Adapting the academic gold-standard (often enough not followed by academic researchers, I know) of establishing a consensus on meaningful metrics before the trial would solve many of the mentioned problems. However and re ” its a real shame that the academic community goes along with this instead of insisting on more meaningful evaluation of LLMs.”, I think that the enormous pace of developments keeps people busy just to stay up to date with the latest state of affairs. New approaches (RAG, agents, MCP, …) keep coming in, as do new releases and use cases, such that even thinking about proper ways to evaluate stuff often falls through the cracks. Which, of course, is a wrong and dangerous prioritization.
LikeLike
If I am being cynical, I would say that the deeper problem is that the NLP community wants numbers and doesnt really care whether the numbers mean anything in the sense of predicting real-world utility.
LikeLike
This blog post is incredibly timely. I noticed that many people were also discussing “The Leaderboard Illusion” paper during NAACL 2025. Lately, I’ve been thinking about how to measure real-world utility. If we can trust market dynamics and user behavior to some extent, perhaps consumer choices and certain market indicators could serve as proxies for real-world utility. However, these signals are undoubtedly noisy, delayed, and susceptible to various external factors. For example, an LLM vendor could exploit loopholes in Chatbot Arena to achieve a higher ranking, thereby gaining more visibility and attracting more users.
LikeLike