Book chapter: Evaluation

A key issue in NLG is evaluation, in other words assessing how well an NLG system works. Does it produce texts which are acceptable in its use case, based on the relevant quality criteria, workflow, and stakeholders (Requirements chapter)? Evaluation can also highlight where systems are weak and need to be improved.

This chapter is the longest in the book, both because it is important and because I am very interested in evaluation. It looks at different type of evaluations (human-based, metric-based, impact, and commercial). As with other parts of the book, it focuses on high-level concepts and issues. not the latest developments.

Sections are:

  • Example: Smoking Cessation (impact evaluation)
  • Fundamentals
  • Human evaluation
  • Automatic evaluation
  • Impact evaluation
  • Commercial evaluation
  • Ten Tips on Evaluating NLG
  • Further reading

Resources: Selected blogs

Resources: Talks

  • Automatic evaluation (Ehud) (PDF)
  • Challenges in Evaluating LLMs (Ehud (PDF)
  • Evaluation Concepts (Ehud) (PDF)
  • Human evaluation (Ehud) (PDF)

Resources: Best Practice, Guides, and Surveys

  • A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice (paper)
  • Huggingface LLM evaluation guidebook (Github)
  • Human evaluation of automatically generated text: Current trends and best practice guidelines (paper)
  • Improving Your Statistical Inferences (paper)
  • Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text (paper)
  • The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing (paper)

Resources: Web sites