evaluation

Even good leaderboards may not be useful, because they are gamed

May 5, 2025May 5, 2025 ehudreiter4 Comments

Most LLM benchmarks and leaderboards are garbage. Unfortunately, it now seems that even the few “good” benchmarks (such as SWEBench and Chatbot Arena) are compromised because they are being gamed by the big LLM vendors, who tweak the benchmarks and rules so that their systems do better.

evaluation

Examples of evaluating real-world impact

Apr 8, 2025Aug 3, 2025 ehudreiter4 Comments

I describe several papers which measure real-world impact of NLP systems, using different methodologies (A/B test, before/after eval, clinical trial, observational study). I hope these examples inspire and encourage more people to consider evaluating real-world impact!

evaluation

Benchmarks distract us from what matters

Mar 26, 2025 ehudreiter7 Comments

I suspect that our fixation with LLM benchmarks may be driving us to optimise LLMs for capabilities that are easier to benchmark (such as math problems) even if they are not of much interest to users; and also to ignore capabilities (such as emotional appropriateness) which are important to real users but hard to assess with benchmarks.

other

People do not understand how LLMs can/cannot help them

Mar 13, 2025 ehudreiter1 Comment

People will make much better use of LLMs if they understand what the technology can and can not do. Unfortunately many people have little understanding of this; I make a few suggestions which perhaps could help a bit.

other

Improving Bayesian Networks

Mar 3, 2025 ehudreiterLeave a comment

Nikolay Babakov has recently published several papers on Bayesian networks, including challenges in reusing BNs, ideas for explaining BNs (work with Jaime Sevilla), and using LLMs to help build BNs. I help to supervise Nikolai, and think BNs can potentially be a useful way to do reasoning with uncertainty which is configurable and explainable.

evaluation

I want a benchmark for emotional upset

Feb 17, 2025Feb 17, 2025 ehudreiter1 Comment

I would love to see benchmarks which assess whether generated texts are emotionally upsetting. This is a major problem which we frequently encounter in our work on using AI to support patients. It would be challenging to build such a benchmark (nothing like it exists today), but we need a braoder range of benchamarks which assess complex real-world quality criteria such as emotional impact.

evaluation

NLG Evaluation 2025 vs 2015: much improved but needs to be better

Feb 4, 2025Feb 4, 2025 ehudreiterLeave a comment

How has NLG evaluation changed in past ten years? Short answer is that tech is much better (eg, LLM-as-judge), but practice (eg experimental rigour) remains poor, and commercial interests are more prominent.

other

Vision: AI personal health assistants

Jan 23, 2025Jan 29, 2025 ehudreiter5 Comments

I think there is enormous potential in using AI personal health assistants to improve health, including things like helping patients manage chronic illness, live more healthily, make informed decisions, and communicate with clinicians. There are huge challenges (technical and non-technical), but if this could be done well, it could radically improve health and enable healthcare systems to cope with increasingly elderly populations.

evaluation

Do LLM coding benchmarks measure real-world utility?

Jan 13, 2025Jan 22, 2025 ehudreiter6 Comments

LLM benchmarks for coding are closer to real-world use than other LLM benchmarks, but they still do not measure real-world utility. I explain this by contrasting what is measured by SWE-bench with what is measured by a recent study of real-world utility in software development.

evaluation

We need better LLM benchmarks

Jan 3, 2025Jan 31, 2025 ehudreiter10 Comments

Current benchmark (suites) for evaluating LLMs are disappointing. I describe the properties that I think good benchmarks and benchmark suites should have, but often do not, such as being correct, challenging, diverse, and real-world.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Even good leaderboards may not be useful, because they are gamed

Examples of evaluating real-world impact

Benchmarks distract us from what matters

People do not understand how LLMs can/cannot help them

Improving Bayesian Networks

I want a benchmark for emotional upset

NLG Evaluation 2025 vs 2015: much improved but needs to be better

Vision: AI personal health assistants

Do LLM coding benchmarks measure real-world utility?

We need better LLM benchmarks