Book chapter: Evaluation

A key issue in NLG is evaluation, in other words assessing how well an NLG system works. Does it produce texts which are acceptable in its use case, based on the relevant quality criteria, workflow, and stakeholders (Requirements chapter)? Evaluation can also highlight where systems are weak and need to be improved.

This chapter is the longest in the book, both because it is important and because I am very interested in evaluation. It looks at different type of evaluations (human-based, metric-based, impact, and commercial). As with other parts of the book, it focuses on high-level concepts and issues. not the latest developments.

Sections are:

Example: Smoking Cessation (impact evaluation)
Fundamentals
Human evaluation
Automatic evaluation
Impact evaluation
Commercial evaluation
Ten Tips on Evaluating NLG
Further reading

Resources: Selected blogs

Challenges in Evaluating LLMs
Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful
Examples of evaluating real-world impact
Future of NLG evaluation: LLMs and high quality human eval?
Humans make mistakes too
One-day class on NLG evaluation
Qualitative evaluation (NOTE: This topic is not covered in the book, it should have been)
Ten tips on doing a good evaluation
We need better LLM benchmarks (NOTE: This topic is not covered in the book)

Resources: Talks

Automatic evaluation (Ehud) (PDF)
Challenges in Evaluating LLMs (Ehud (PDF)
Evaluation Concepts (Ehud) (PDF)
Human evaluation (Ehud) (PDF)

Resources: Best Practice, Guides, and Surveys

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice (paper)
Huggingface LLM evaluation guidebook (Github)
Human evaluation of automatically generated text: Current trends and best practice guidelines (paper)
Improving Your Statistical Inferences (paper)
Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text (paper)
The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing (paper)

Resources: Web sites

Hands-on Evaluation Exercise (Google Form) (part of my One-day class on NLG evaluation)
Huggingface evaluation library (automatic metrics) (link)

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Book chapter: Evaluation

Resources: Selected blogs

Resources: Talks

Resources: Best Practice, Guides, and Surveys

Resources: Web sites

Resources: Selected blogs

Resources: Talks

Resources: Best Practice, Guides, and Surveys

Resources: Web sites

Share this:

Share this: