evaluation

Do LLM coding benchmarks measure real-world utility?

I recently wrote a blog which (amongst other things) complained that LLM benchmarks did not measure real-world utility. A few people responded that they thought coding benchmarks might be an exception, since many software developers use LLMs to help them create software.

A key point is that LLM benchmarks measure very different things from studies that evaluate real-world utility. I give examples below of both a (good) LLM coding benchmark and a (good) real-world evaluation, and discuss the difference.

SWE-bench: a good coding benchmark

The best LLM coding benchmark which I am aware of is SWE-bench (and variants such as SWE-bench Verified). Whereas most coding benchmarks (such as Humaneval) ask LLMs to solve “Leetcode” problems, ie one-off smallish stand-alone coding tasks, SWE-bench looks at how well LLMs can address real Github issues (bug reports or feature requests) on Github repositories for complex projects such as django. Success is measured by passing associated unit and system tests.

SWE-bench is far closer to what real-world software developers do than benchmarks such as HumanEval! Its also much more challenging, and one of the disappointments I had with the Amazon Nova evaluation (which I discussed in previous blogs) is that it uses Humaneval instead of SWE-bench to assess coding performance.

With that being said, its also important to remember that SWE-bench does not measure many quality criteria that are important in real-world software development, including

  • Code clarity and maintainability. Wen et al 2024 claim that the widely used RHLF alignment technique trains LLMs to generate code which is difficult to understand and debug, so it is possible that LLM-generated solutions which work are unclear and hard to maintain.
  • Debugging time. If code does not work, than in the real-world developers will debug it until it does work. Many developers have complained to me that debugging incorrect LLM-produced code is much harder than debugging code they wrote themselves.
  • Security. Code of course should be secure and not contain bugs which hackers can exploit.
  • Performance. Run-time efficiency is important; we want fast code.
  • Works well with human developers. In the real-world, LLMs are used to assist humans, so we care how well they do this, not just well they write code on their own.

Jatin Ganhotra also made the point to me that SWE-bench includes a wide range of different issues, and performance on the overall collection may not be a good predictor of performance on specific types of issues (blog).

Real-world utility of Github Copilot

Pandey et al 2024 asked 26 engineers to use Github Copilot on a variety of software development tasks, and compared their productivity to the productivity of similar tasks done without Copilot. The engineers worked on real projects and kept detailed logs; this enabled analysis of Copilot’s effectiveness in different tasks and contexts. Pandey et al exclude requirements analysis, design, dependency alignment and coordination, and test execution, which they acknowledge is a significant part of development effort.

The paper is fascinating, not least for qualitative insights and examples (including many of the issues mentioned above, eg code from Copilot which is too slow). I strongly recommend that anyone interested in this topic read the paper.

In terms of numbers, on average they report that Copilot led to efficiency gains (time savings) of around 1/3 on the tasks it was used for (although I suspect this may be inflated by Hawthorne effect). Efficiency gains were highest for documentation, and lowest for debugging and refactoring. From a language perspective, efficiency gains were highest for Javascript, and lowest for C/C++ . Perhaps most importantly, efficiency gains were much higher for repetitive tasks than for complex tasks.

Pandey et al conclude by saying (amongst other things)

Copilot excels at reducing the time developers spend on repetitive and boilerplate tasks through its autocomplete functions and ability to generate relatively good boilerplate code… The tool also supports the generation of high-quality, consistent code comments and documentation that adhere to language conventions and style guides…

GitHub Copilot has limited utility for advanced proprietary code, such as code that implements unique business logic, or where the relevant code is distributed over many files… Copilot can sometimes hallucinate, but this is less of an issue [than] more subtle errors such as missing error checks, unoptimized code or insecure code…

In conclusion, while GitHub Copilot presents substantial benefits in enhancing developer productivity and code quality, awareness of its limitations and careful implementation are crucial to maximizing its effectiveness in software development environments.

SWE-bench vs Pandey et al

As can be seen from the above, the real-world study of utility is **very** different from even a good coding benchmark such as SWE-bench! The real-world study looks at many more tasks than just coding and many quality criteria beyond correct functionality; it measures impact on developer productivity and tries to assess where LLMs can “add value” in the software development process.

If an LLM performs well on SWE-bench, will it prove to be effective in a Pandey-style real world evaluation? I’m not aware of any data on this, but the fact that the scope of SWE-bench is very limited compared to Pandey et al (as described above) is not encouraging. Since SWE-bench type coding tasks are only a small part of what Pandey et al look at, doing well on SWE-bench type tasks isnt going to have much impact on overall productivity as measured by Pandey et al. Similarly an LLM which generates functionally correct code will not be used if the code is hard to maintain, slow, hackable, etc.

Of course it is possible that an LLM that does well at SWE-bench coding tasks will also do well on other tasks such as documentation and debugging; and an LLM that produces functional code may also produce code that is efficient, secure, maintainable, etc. But we cannot assume this is the case.

Final thoughts

LLMs are widely used by software developers and are having a real impact on productivity; their impact on software development is much greater than, say, their impact on healthcare. Pandey et al’s estimate of a 33% increase in productivity may be too high, but even a 20% increase in productivity is worth hundreds of billions to the global economy! So measuring how effective LLMs are at helping developers is important, especially if this gives us insights as to where/when/how to use LLM technology to help developers.

I respect SWE-bench’s attempt to be more realistic than previous coding benchmarks. Despite this, though, it remains true that SWE-bench focuses on just a few tasks and on the single quality criteria of correct functionality; good performance here will not have much impact on productivity if there is no gain in other tasks and quality criteria. So on its own SWE-bench is not a reliable measure of how much LLMs help developers, and does not give good guidance on where/when/how to use the technology.

I guess my biases are clear. Pandey et al’s experiment has flaws, but they are trying to measure important things which matter in the real world. I’m not sure this is the case for SWE-bench and other common LLM coding benchmarks.

Postscript 22-Jan-25: I saw a really interesting note on X showing that performance on SWE-Bench Verified (OpenAI’s version of SWE-Bench) depends on how exactly it is measured. Sometimes people like benchmarks because they seem to give very precise numbers such as 48.9%, but measurement issues may mean that what “48.9%” really means is “somewhere between 20% and 60%”