evaluation

Do LLM coding benchmarks measure real-world utility?

Jan 13, 2025Jan 22, 2025 ehudreiter6 Comments

LLM benchmarks for coding are closer to real-world use than other LLM benchmarks, but they still do not measure real-world utility. I explain this by contrasting what is measured by SWE-bench with what is measured by a recent study of real-world utility in software development.

evaluation

We need better LLM benchmarks

Jan 3, 2025Jan 31, 2025 ehudreiter9 Comments

Current benchmark (suites) for evaluating LLMs are disappointing. I describe the properties that I think good benchmarks and benchmark suites should have, but often do not, such as being correct, challenging, diverse, and real-world.

evaluation

Do LLM benchmarks ignore NLG?

Dec 26, 2024Dec 27, 2024 ehudreiter2 Comments

I was very disappointed to realise that the evaluation suite for Amazon Nova (and I assume for other LLMs) has poor coverage of NLG tasks. Which is surprising since LLMs are largely used to generate texts; shouldnt they be evaluated, at least in part, on their ability to do this well?

evaluation

MQM shows the power of a gold-standard evaluation

Dec 2, 2024 ehudreiter2 Comments

I am very happy to see that the MT community is adopting the annotation-based MQM protocol as a gold-standard evalution technique. Having such a gold standard both strengthens evaluation and also supports exciting new research in evaluation.

evaluation

Qualitative evaluation

Oct 7, 2024Oct 7, 2024 ehudreiter2 Comments

In NLG we focus on quantitative evaluation, but qualitative techniques can also be used. Quantatitive hypothesis testing is essential, but its also really useful to ask people what they think of an NLG system in an open-ended way.

evaluation

One-day class on NLG evaluation

Sep 9, 2024Sep 9, 2024 ehudreiter3 Comments

In early Sept I ran a one-day class on evaluation. I summarise what I did in this class and give links to my presentations, in case this is useful to other people.

evaluation

Challenges in Evaluating LLMs

Jul 10, 2024Jul 19, 2024 ehudreiter2 Comments

I list five challenges to evaluating LLMs, which unfortunately seem to be ignored by many researchers. Which means that many published LLM evaluations cannot be trusted. This blog is based on a recent workshop talk.

evaluation

Can LLM-based eval replace human evaluation?

Jun 11, 2024 ehudreiter3 Comments

I suspect we may be reaching the point where the most common type of human evaluation in NLG (ratings/rankings by crowdworkers or students) are less meaningful than evaluations using LLMs. But better forms of human evaluation, based on annotation or impact, are still very useful and give insights which we cannot get from LLMs.

evaluation

Human eval: Subjects must understand the task

May 28, 2024May 28, 2024 ehudreiter2 Comments

In human evaluation, it is absolutely essential that subjects understand what they are supposed to do; otherwise evaluations will not be meaningful or replicable. This may sound obvious, but it was repeatedly raised as a concern in the replication shared task in the 2024 Human Evaluation workshop.

evaluation

Ten tips on doing a good evaluation

Apr 8, 2024 ehudreiter2 Comments

I present some suggestions for doing good evaluations, which are based on previous blogs I have written.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Category: evaluation

Do LLM coding benchmarks measure real-world utility?

We need better LLM benchmarks

Do LLM benchmarks ignore NLG?

MQM shows the power of a gold-standard evaluation

Qualitative evaluation

One-day class on NLG evaluation

Challenges in Evaluating LLMs

Can LLM-based eval replace human evaluation?

Human eval: Subjects must understand the task

Ten tips on doing a good evaluation