Lets use error annotations to evaluate systems!

Markus Freitag gave a great keynote at the Human Evaluation workshop last week, where he argued that (A) high-quality human evaluation is essential to progress, and (B) asking experts to find errors in texts is often a better human evaluation that soliciting Likert-like ratings. Markus gave some wonderful real-world examples of (A), where high-quality human evaluations gave a much better indication of which approaches are promising than metrics or lower-quality human evaluation; I loved this!

Here I want to focus on (B), because this is also something I agree with. In an old blog, I said that human evaluations were either

  • extrinsic: task-based evaluations where we measure impact of NLG system on real-world outcomes such as decision quality
  • intrinsic: evaluations where we asked subjects to rate or rank texts on the basis of characteristics such as readability, accuracy, and usefulness.

Other writers about human evaluation of NLG have similarly assumed that NLG evaluations are based on either task performance or ratings/rankings.

Recently, though, we’ve seen growing interest in a different type of evaluation, where subjects who are knowledgeable in the field (ie, not random Turkers) annotate errors in a generated text, typically assigning a classification and sometimes assigning a severity. I think this is a really interesting development.

In a sense, evaluation-by-errors sits between task-based and rating/ranking-based evaluation. Its results are more meaningful than rating/ranking evaluation, but its also usually more expensive and time-consuming to carry out. On the other hand, evaluation-by-errors is considerably cheaper, faster, and easier to organise than task-based evaluation, but its results are less meaningful. As such, its a very useful addition to our set of evaluation techniques!


Machine translation: The example Freitag talked about was using MQM to evaluate machine translation (Freitag et al 2021). In this exercise, professional translators annotate errors in MT texts, assigning each error a category such as Accuracy/Omission or Fluency/Punctuation, and also a severity such as Minor or Major. A formula can be used to calculate an overall score for the text based on its errors. Freitag argues that this process gives much more meaningful evaluation than asking Turkers to rate texts: the overall score is a better predictor of utility than Turker ratings, the scoring formula can easily be adjusted for different use cases, and the error annotations give much better insight as to what is working and what isnt.

Data-to-text generation: Craig Thomson and I developed an error annotation procedure for data-to-text (Thomson and Reiter 2020). We asked Turkers with domain knowledge (who passed a qualifying test) to annotate errors in summaries of basketball games, focusing on content (semantic or pragmatic) errors. Annotators assigned a category such as Name, Number, or Word to each error; we didnt ask for severity information. If desired, systems can be compared based on the average number of errors in their texts. We believe this analysis is very important in understanding the quality of texts produced by data-to-text systems and in particular reveals content/hallucination errors (some of which are quite subtle) which are very important in data-to-text but often missed by current evaluations; the analysis also gives good insight as to where these systems need to be improved.

Summarisation: Another one of my students, Francesco Moramarco, has been evaluating summarisation systems (for doctor-patient dialogues) on the basis of both amount/time of post-editing needed (an extrinsic/task measure) and types of errors made (Moramarco et al 2022). The error analysis is done by doctors, who classified each error into a category such as Hallucination or Misleading Statement; doctors also indicated whether each error was Critical (ie, a binary severity assessment). Moramarco et al showed that post-edit time (the extrinsic measure) had a high correlation with the number of errors found by the doctors. As with the above examples, the error analysis also gave good insight on problems in the texts.

My thoughts

The goal of evaluation is to assess how effective a system, model, or technique is in generating texts that are useful in real-world contexts. As such, there is no substitute for extrinsic task-based evaluation which directly measures this. However, extrinsic evaluation is expensive and time-consuming, and in some contexts is difficult to carry out because of ethical issues. It also may give us only limited insight on what needs to be fixed and improved.

If extrinsic evaluation is not possible, I increasingly believe that evaluation by error annotation is the best alternative. Specifically, I believe that the results of error-based evaluation will usually be more meaningful (better predictors of real-world utility) than asking crowdworkers to rate or rank texts, especially in contexts where content quality is of paramount importance. Evaluation-by-errors will certainly be more meaningful than automatic metrics! Furthermore, error analysis will give us excellent insights about where generation is failing and needs to be improved; in this regard it may be superior in many contexts to extrinsic evaluation, as well as rating/ranking and metric evaluations.

The research community is in the early stages of exploring how best to carry our error-based evaluations. I encourage interested researchers to jump in and contribute to this effort!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s