I recently talked to a journalist about LLM benchmarks, expressing my frustration with the current situation. During our chat, amongst other things the journalist speculated that:
- Capabilities that cannot be assessed by standard benchmarks are regarded as less interesting and important, this includes the increased emotional sensitivity of GPT 4.5 .
- Standard benchmarks are an essential tool for guiding the development of models.
I suspect he may have been exaggerating in order to be provocative. But anyways, this discussion highlighted to me that there is a real danger that our fixation with benchmarks will make models less useful at real-world tasks.
Don’t ignore things that are hard to measure
I’m very interested in using AI to support patients, and one problem we have repeatedly seen is that LLMs generate texts that are emotionally upsetting or otherwise inappropriate. So improving the ability of LLMs to generate emotionally appropriate texts is very important in this context, and for this reason I applaud GP 4.5’s advancement; other people are also telling me that GPT 4.5 does a significantly better job at some patient-facing applications.
So to dismiss GPT 4.5’s improved emotional capabilities because they are hard to measure seems completely inappropriate, to put it mildly. Especially since I have been trying and failing to get support to start working on benchmarks in this area. The logic seems circular, to be honest: little interest in benchmarks for emotional capabilities, so these capabilities regarded as less important, thus further diminishing interest in developing benchmarks…
Kapoor et al’s excellent review of AI in Law (paper) makes a related point. The capabilities that would have the most impact in legal contexts are the ones which are hardest to measure.
Don’t fixate on things that are easy to measure
As I have said elsewhere, most of the standard LLM benchmarks do not measure capabilities that are widely used. For example, very few people use LLMs to solve math problems (MATH, GSM8K, etc) or to answer complex scientific questions (MMLU, GPQA, etc). So optimising LLMs to do well on math or scientific benchmarks means optimising them for things that not many people care about, which seems somewhat bizarre.
The exception is software development benchmarks, since LLMs are widely used for software development. But even here, the best benchmarks such as SWEBench only measure limited aspects of software development (blog); ie, they have limited “construct validity”. I’ve also noticed that a lot of benchmarking exercises opt for simpler and less meaningful coding benchmarks such as Humaneval, because running SWEBench is complex and expensive, and the community (as often happens) seems to prefer cheap-and-easy evaluations even if they are not useful predictors of real-world utility.
Measure a wide range of capabilities
The journalist also asked how I thought LLMs should assessed. I mentioned a few limited practical things. Eg he had used MMLU in some articles, I pointed out that MMLU was buggy, leaked/contaminated, and too easy, and that MMLU-Pro was better if he felt he had to report something in this space. I also made some suggestions which I suspect will not be happening anytime soon, such as more use of human evaluation and real-world KPI meaurements (impact).
An intermediate option, which is more ambitious that replacing MMLU by MMLU-Pro but less ambitious than evaluating real-world impact, is to have much better benchmark suites which measure a wide range of capabilities, including capabilities which users actually care about. Eg, there is not much point in including three math-reasoning benchmarks in a suite, especially since very few people use LLMs for mathematical reasoning.
From this perspective, I was really interested in the new HELM Capabilities suite from Stanford, because it explicitly mentioned alignment with a diverse set of capabilities as a goal. To be honest, I found the actual contents of the Capabilities suite to be disappointing; still considerable overlap (eg, included both MMLU-Pro and GPQA, which I think measure similar capabilities) and no applied benchmarks such as software development. But still, the vision is good, and hopefully the execution will get better.
Final thoughts
The LLM community (commercial as well as research) seems fixated on benchmarks which are easy to run and measure things that are easy to measure. I think this is dangerous, because a lot of the things that real LLM users care about are not easy to measure. And we cannot ignore these if we want LLMs to genuinely help real-world users.
Of course. individual users can and will decide which LLM to use based on their own requirements, not based on benchmarks. But LLMs will be less useful to these people if the community ignores their requirements because they are hard to benchmark.
I’m always a fan of measuring benefits from actual use.
This study looked int othe effect of an AI scribe (DAX Copilot) in a clinical setting:
Duggan, M. J., Gervase, J., Schoenbaum, A., Hanson, W., Howell, J. T., Sheinberg, M., & Johnson, K. B. (2025). Clinician experiences with ambient scribe technology to assist with documentation burden and efficiency. JAMA Network Open, 8(2), e2460637-e2460637. (link)
I would love to see evaluations like these. Of course, only if the setting is safe and only as an addition to other metrics.
LikeLike
Great paper, thanks for mentioning it!
LikeLike
I fully agree that LLM benchmarks are often not measuring what matters. And the issue is made worse by LLM’s ability to address a vast number of tasks. One possible approach to solve this, which we pursued in this preprint, is to annotate items of existing benchmarks according to what demands they pose on a range of cognitive capabilities. Combining this with the performance of an LLM yields strong predictive performance on new tasks, which may be a way to cover the space of tasks more effectively.
LikeLike
Thanks for the comment and paper. To be honest, it looks like the tasks you are looking at are still perhaps artificial. Have you considered adding real-world tasks such as coding assistance or medical report writing?
LikeLike
Yes, indeed you are right that the benchmarks we use in that paper are artificial. What we would need to do is to check whether the annotation and performance on the “artificial” tasks are predictive of performance on real tasks — or indeed using realistic tasks in the first place directly, and checking if there is generalisation across different tasks.
LikeLike