evaluation

Do LLMs cheat on benchmarks

LLMs often “cheat” on benchmarks via data contamination and reward hacking. Unfortunately, this problem seems to be getting worse, perhaps because of perverse incentives. If we want to genuinely and meaningfully evaluate LLMs, we need to move beyond benchmarks and start measuring real-world impact.

AI in Healthcare

Most common uses of AI in Healthcare

I review some data on usage of AI in healthcare, and conclude that the most common uses in 2025 are probably (A) giving personalised health information to patients and (B) helping clinicians write documents. We’ve worked on these topics at Aberdeen, but most researchers focus on AI for decision support, which is not widely used.

evaluation

More on evaluating impact

I recently published a paper and gave a talk about evaluating real-world impact. I got some great feedback from this, and summarise some of the suggested papers (including more examples of impact eval) and insightful comments (eg, about eval “ecosystem”) I received.