Skip to content

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

  • Home
  • Blog Index
  • About
  • What is NLG
  • Publications
  • Resources
  • University
  • Book
  • Contact

Category: evaluation

evaluation

Do LLM benchmarks ignore NLG?

Dec 26, 2024Dec 27, 2024 ehudreiter2 Comments

I was very disappointed to realise that the evaluation suite for Amazon Nova (and I assume for other LLMs) has poor coverage of NLG tasks. Which is surprising since LLMs are largely used to generate texts; shouldnt they be evaluated, at least in part, on their ability to do this well?

evaluation

MQM shows the power of a gold-standard evaluation

Dec 2, 2024 ehudreiter1 Comment

I am very happy to see that the MT community is adopting the annotation-based MQM protocol as a gold-standard evalution technique. Having such a gold standard both strengthens evaluation and also supports exciting new research in evaluation.

evaluation

Qualitative evaluation

Oct 7, 2024Oct 7, 2024 ehudreiter1 Comment

In NLG we focus on quantitative evaluation, but qualitative techniques can also be used. Quantatitive hypothesis testing is essential, but its also really useful to ask people what they think of an NLG system in an open-ended way.

evaluation

One-day class on NLG evaluation

Sep 9, 2024Sep 9, 2024 ehudreiter3 Comments

In early Sept I ran a one-day class on evaluation. I summarise what I did in this class and give links to my presentations, in case this is useful to other people.

evaluation

Challenges in Evaluating LLMs

Jul 10, 2024Jul 19, 2024 ehudreiter2 Comments

I list five challenges to evaluating LLMs, which unfortunately seem to be ignored by many researchers. Which means that many published LLM evaluations cannot be trusted. This blog is based on a recent workshop talk.

evaluation

Can LLM-based eval replace human evaluation?

Jun 11, 2024 ehudreiter3 Comments

I suspect we may be reaching the point where the most common type of human evaluation in NLG (ratings/rankings by crowdworkers or students) are less meaningful than evaluations using LLMs. But better forms of human evaluation, based on annotation or impact, are still very useful and give insights which we cannot get from LLMs.

evaluation

Human eval: Subjects must understand the task

May 28, 2024May 28, 2024 ehudreiter2 Comments

In human evaluation, it is absolutely essential that subjects understand what they are supposed to do; otherwise evaluations will not be meaningful or replicable. This may sound obvious, but it was repeatedly raised as a concern in the replication shared task in the 2024 Human Evaluation workshop.

evaluation

Ten tips on doing a good evaluation

Apr 8, 2024 ehudreiter2 Comments

I present some suggestions for doing good evaluations, which are based on previous blogs I have written.

evaluation

I’m very worried about data contamination

Mar 12, 2024Mar 13, 2024 ehudreiter9 Comments

Data contamination (testing and evaluating LLMs using test data which is known the the LLM) may be a huge problem in NLP, leading to a lot of invalid scientific claims. Unfortunately, many NLP researchers ignore the problem, which is really worrying.

evaluation

We should evaluate real-world impact!

Nov 13, 2023Aug 3, 2025 ehudreiter14 Comments

It is very rare to see evaluations in the NLP research literature which are based on measuring the impact of systems on real-world users. I’d love to see more such evaluations, and describe some ways of doing this, along with a few examples.

Posts navigation

Older Posts
Newer posts
  • LinkedIn
  • Twitter

News: I am likely to retire in summer 2026. Looking for interesting things to do afterwards.

Top Posts & Pages

  • What LLMs cannot do
  • Publish in Journals!
  • Do LLMs cheat on benchmarks
  • We need better LLM benchmarks
  • Is building neural NLG faster than rules NLG? No one knows, but I suspect not.
  • Generated Texts Must Be Accurate!
  • Do We Encourage Researchers to Use Inappropriate Data Sets?
  • Benchmarks distract us from what matters
  • Google: Please Stop Telling Lies About Me
  • We Need Robust Ways to Select Content of NLG Texts
Blog at WordPress.com.
Ehud Reiter's Blog
Blog at WordPress.com.
  • Subscribe Subscribed
    • Ehud Reiter's Blog
    • Join 100 other subscribers.
    • Already have a WordPress.com account? Log in now.
    • Ehud Reiter's Blog
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...