evaluation

Dont ignore omissions!

Feb 11, 2026 ehudreiter1 Comment

Most semantic evaluation of LLMs focuses on accuracy and hallucination. These are very important, but it is also important to look at completeness and omission; does the generated text include all of the key information which the user needs to know? Omissions are a huge problem in medical NLG, and in other NLG tasks as well.

evaluation

Do a sanity check on your experiments

Dec 22, 2025Dec 22, 2025 ehudreiterLeave a comment

I strongly recommend that researchers do “sanity checks” on data, model outputs, and evaluation results, looking for anomalies. This can help detect data errors, model cheating, software bugs, and other flaws which distort experiments.

evaluation

Do LLMs cheat on benchmarks

Dec 8, 2025 ehudreiter1 Comment

LLMs often “cheat” on benchmarks via data contamination and reward hacking. Unfortunately, this problem seems to be getting worse, perhaps because of perverse incentives. If we want to genuinely and meaningfully evaluate LLMs, we need to move beyond benchmarks and start measuring real-world impact.

evaluation

Defining hallucination is not straightforward

Sep 10, 2025Sep 11, 2025 ehudreiter3 Comments

Most academic work assumes that hallucination is a binary feature: either something is a hallucination or it is not a hallucination. But this is too simplistic. In real-world contexts we see many subtleties, eg some hallucinations are much more damaging than others, statements which are literally true can still mislead readers because of context, and there are many borderline cases.

evaluation

More on evaluating impact

Aug 5, 2025 ehudreiter3 Comments

I recently published a paper and gave a talk about evaluating real-world impact. I got some great feedback from this, and summarise some of the suggested papers (including more examples of impact eval) and insightful comments (eg, about eval “ecosystem”) I received.

evaluation

Even good leaderboards may not be useful, because they are gamed

May 5, 2025May 5, 2025 ehudreiter3 Comments

Most LLM benchmarks and leaderboards are garbage. Unfortunately, it now seems that even the few “good” benchmarks (such as SWEBench and Chatbot Arena) are compromised because they are being gamed by the big LLM vendors, who tweak the benchmarks and rules so that their systems do better.

evaluation

Examples of evaluating real-world impact

Apr 8, 2025Aug 3, 2025 ehudreiter4 Comments

I describe several papers which measure real-world impact of NLP systems, using different methodologies (A/B test, before/after eval, clinical trial, observational study). I hope these examples inspire and encourage more people to consider evaluating real-world impact!

evaluation

Benchmarks distract us from what matters

Mar 26, 2025 ehudreiter6 Comments

I suspect that our fixation with LLM benchmarks may be driving us to optimise LLMs for capabilities that are easier to benchmark (such as math problems) even if they are not of much interest to users; and also to ignore capabilities (such as emotional appropriateness) which are important to real users but hard to assess with benchmarks.

evaluation

I want a benchmark for emotional upset

Feb 17, 2025Feb 17, 2025 ehudreiter1 Comment

I would love to see benchmarks which assess whether generated texts are emotionally upsetting. This is a major problem which we frequently encounter in our work on using AI to support patients. It would be challenging to build such a benchmark (nothing like it exists today), but we need a braoder range of benchamarks which assess complex real-world quality criteria such as emotional impact.

evaluation

NLG Evaluation 2025 vs 2015: much improved but needs to be better

Feb 4, 2025Feb 4, 2025 ehudreiterLeave a comment

How has NLG evaluation changed in past ten years? Short answer is that tech is much better (eg, LLM-as-judge), but practice (eg experimental rigour) remains poor, and commercial interests are more prominent.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Category: evaluation

Dont ignore omissions!

Do a sanity check on your experiments

Do LLMs cheat on benchmarks

Defining hallucination is not straightforward

More on evaluating impact

Even good leaderboards may not be useful, because they are gamed

Examples of evaluating real-world impact

Benchmarks distract us from what matters

I want a benchmark for emotional upset

NLG Evaluation 2025 vs 2015: much improved but needs to be better