evaluation

I’m very worried about data contamination

Mar 12, 2024Mar 13, 2024 ehudreiter9 Comments

Data contamination (testing and evaluating LLMs using test data which is known the the LLM) may be a huge problem in NLP, leading to a lot of invalid scientific claims. Unfortunately, many NLP researchers ignore the problem, which is really worrying.

evaluation

We should evaluate real-world impact!

Nov 13, 2023Aug 3, 2025 ehudreiter14 Comments

It is very rare to see evaluations in the NLP research literature which are based on measuring the impact of systems on real-world users. I’d love to see more such evaluations, and describe some ways of doing this, along with a few examples.

evaluation

A bad way to measure hallucination

Oct 31, 2023Dec 4, 2023 ehudreiter5 Comments

The easiest eay to measure hallucinations is to ask Turkers to count incorrect statements in a text. However reproduction papers published at the Human Evaluartion workshop suggest that this is not a reliable way to measure hallucination. Hopefully researchers will switch to better ways to measure hallucination!

evaluation

There are many types of human evaluation!

Sep 13, 2023 ehudreiter1 Comment

Many people asume that “human evaluation” means asking people to rate or rank outputs. However there are many other types of human evaluation, most of which give more meaningful results than rating or ranking! I discuss some of these, including task-based evaluation, annotation-based evaluation, and real-world evaluation.

evaluation

My MSc students evaluate chatGPT

Aug 16, 2023Aug 18, 2023 ehudreiterLeave a comment

Four of my MSc students did projects evaluating chatGPT in different use cases: developers assist, translation, exam taking, emotional support. In every use case, chatGPT was often very impressive, but also had issues and limitations.

evaluation

Are Experts Needed in Human Evaluation?

Jul 10, 2023 ehudreiter3 Comments

An ACL paper from the PhilHuman project looks at using experts vs non-experts in human evaluation. It concludes that the agreement between experts and non-experts is worse for texts from GPT3 than texts from GPT2; in other words, non-expert evaluation is less useful for high-quality texts produced by recent LLMs.

evaluation

Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful

Jun 21, 2023 ehudreiter1 Comment

Evaluation advice in one sentence: plan your experiments in advance, including details; keep your experiment as simple and standard as possible; do a pilot experiment first to make sure everything works; and be very careful when you run the main experiment.

evaluation

Future of NLG evaluation: LLMs and high quality human eval?

May 22, 2023May 22, 2023 ehudreiter3 Comments

We may see a big change in NLG evaluation over the next few years, with LLM-based evaluation replacing metrics such as BLEU and BLEURT, and a renewed emphasis on high-quality human evaluation to assess semantic and pragmatic correctness. Would be a step forward if this happens!

evaluation

Evaluating chatGPT

Apr 4, 2023Apr 27, 2023 ehudreiter11 Comments

I love getting questions about how to evaluate chatGPT, they are much more constructive than speculations about whether it is a threat to humanity. We need to understand what LLM technology can and cannot do, and rigorous experiments are the best way to do this. I give some advice and caveats about evaluating chatGPT in this blog, and am happy to answer questions from people who want to do high-quality evaluations.

evaluation

Evaluating factual accuracy in complex data-to-text

Feb 7, 2023Feb 11, 2023 ehudreiter6 Comments

CSL journal has just published a paper “Evaluating factual accuracy in complex data-to-text”, which summarises our work in this area. I strongly recommend the paper to anyone who is interested in evaluating the accuracy of texts produced by neural NLG systems.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Category: evaluation

I’m very worried about data contamination

We should evaluate real-world impact!

A bad way to measure hallucination

There are many types of human evaluation!

My MSc students evaluate chatGPT

Are Experts Needed in Human Evaluation?

Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful

Future of NLG evaluation: LLMs and high quality human eval?

Evaluating chatGPT

Evaluating factual accuracy in complex data-to-text