A key issue in NLG is evaluation, in other words assessing how well an NLG system works. Does it produce texts which are acceptable in its use case, based on the relevant quality criteria, workflow, and stakeholders (Requirements chapter)? Evaluation can also highlight where systems are weak and need to be improved.
This chapter is the longest in the book, both because it is important and because I am very interested in evaluation. It looks at different type of evaluations (human-based, metric-based, impact, and commercial). As with other parts of the book, it focuses on high-level concepts and issues. not the latest developments.
Sections are:
- Example: Smoking Cessation (impact evaluation)
- Fundamentals
- Human evaluation
- Automatic evaluation
- Impact evaluation
- Commercial evaluation
- Ten Tips on Evaluating NLG
- Further reading
Resources: Selected blogs
- Challenges in Evaluating LLMs
- Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful
- Examples of evaluating real-world impact
- Future of NLG evaluation: LLMs and high quality human eval?
- Humans make mistakes too
- One-day class on NLG evaluation
- Qualitative evaluation (NOTE: This topic is not covered in the book, it should have been)
- Ten tips on doing a good evaluation
- We need better LLM benchmarks (NOTE: This topic is not covered in the book)
Resources: Talks
- Automatic evaluation (Ehud) (PDF)
- Challenges in Evaluating LLMs (Ehud (PDF)
- Evaluation Concepts (Ehud) (PDF)
- Human evaluation (Ehud) (PDF)
Resources: Best Practice, Guides, and Surveys
- A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice (paper)
- Huggingface LLM evaluation guidebook (Github)
- Human evaluation of automatically generated text: Current trends and best practice guidelines (paper)
- Improving Your Statistical Inferences (paper)
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text (paper)
- The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing (paper)
Resources: Web sites
- Hands-on Evaluation Exercise (Google Form) (part of my One-day class on NLG evaluation)
- Huggingface evaluation library (automatic metrics) (link)