evaluation

Do LLMs cheat on benchmarks

The LLM world is fixated on using benchmarks to measure performance. In previous blogs (eg blog) I have complained about benchmarks which are buggy, measure things users dont care about, are saturated, and/or suffer from data contamination. Of course many other people have made similar complaints, eg Bean et al 2025.

However, there is a deeper and more fundamental problem, which is that LLMs can and do solve benchmarks in ways which do not predict real-world utility. Data contamination (where the LLM has seen the benchmark answers in its training data and simply regurgitates them) is one example, but there are others.

Reward hacking

One generic problem is reward hacking. For example, Baker et al 2025 show that in some cases LLMs can do very well on coding benchmakes which are based on unit tests, by using system calls to bypass the tests (if the tests are not run, they cannot fail). In the words of Zhong et al 2025, “capable LLMs may find and exploit “shortcuts” to pass the tests instead of solving genuine issues, effectively cheating their way to success.”

I find Zhong et al to be a particularly interesting paper here. In the coding space, they set up an “ImpossibleBench” which consists of unit tests which are wrong. For example, they create a unit test for a function f which asserts that both f(2) == 4 and f(2) == 5. The only way pass such unit tests is to cheat, and they discover that models frequently do so; the worst is GPT5, which cheats in 76% of cases. Cheating techniques are described in the paper; for example the models redefine the equality operator used in the unit tests so that it always returns true. Another interesting paper in this space with real-world examples of cheating is Hamin and Edelman 2025.

Machine learning systems of course look for ways to solve problems and optimise reward functions, they do not distinguish between “cheating” (copying answers, bypassing unit tests) and solving the problem in a generalisable and useful way. If the most effective strategy to solving a problem involves what humans call cheating, then an ML system will cheat.

Perverse incentives

The situation is made worse by perverse incentives to the humans involved. Real software developers, testers, and companies value unit tests as way to deliver higher-quality products to their customers. No developer, tester, or company I have worked with would bypass unit tests as described above, because the tests are a valuable tool to achieving their goal of high-quality software.

However, in the bizarre world of AI, academics are highly motivated to show that their systems perform well on benchmarks, and realise that the research community in general places little value on supporting such claims with high-quality careful experiments (blog). In commercial AI, LLM vendors seem to regard benchmarks as marketing tools to increase their valuation, and have noticed that while valuations go up if they report good benchmark performance, they dont go down if obscure academics question the validity of their claims. In both cases, the incentives are to do whatever it takes to show good performance, even if this means ignoring “cheating”.

Measure real-world impact instead of benchmarks!

As models become larger and more sophisticated, it unfortunately seems that they cheat more. Zhong et al found that GPT5 had the highest cheating rate in the coding domain. Another worrying development is search-time data contamination (Han et al 2025). In the past, we could avoid data contamination (test data being present in a model’s training data) by using test data created after the LLM’s training cut-off date. But modern LLM’s use real-time internet search to find information, which means that we cannot use any test data on the internet. For example, Han et al report that some models performance on question-answering benchmarks fell by 15% when access to Huggingface (which contained some of the test data) was blocked.

Many suggestions have been made to improve benchmarks to mitigate the above problems; almost every paper cited above does this. The suggestions are good and it would be great if they were adopted. However, I am not optimistic, in part because of the perverse incentives mentioned above. Academic research culture places little value on careful evaluation (blog), and LLM vendors seem to regard benchmarks as marketing tools where (as in all marketing) they can bend the truth. This is not a good context for carefully constructing high-quality and meaningful benchmarks.

If we really want to measure how effective and useful LLMs are, then we need to move beyond benchmarks and directly measure the real-world impact of LLMs on actual users (blog). In the software development domain, for example, we can conduct experiments where we measure impact of LLM tools on developer productivity (Becker et al 2025). Such experiments are currently very rare (Reiter 2025), but I see no alternative if we actually care about accurate measures of effectiveness and utility.

One thought on “Do LLMs cheat on benchmarks

Leave a comment