A huge problem in NLG in 2020, especially end-to-end neural NLG, is that our systems generate texts which are not accurate. I’ve been working in data-to-text NLG for a long time, both commercially and as a researcher, and almost every use case I’ve seen requires accurate texts. So we need to do better from an accuracy perspective!
And one thing that frustrates me about academic NLG research is that most of the academic papers I read do not tell me much about the accuracy of the texts produced by the author’s systems or models. Usually I see metrics such as BLEU which are useless at evaluating accuracy (Belz and Reiter 2009). Sometimes I see human evaluations where subjects are asked to assess accuracy on a Likert scale or compare accuracy of different texts; this is better (and I’ve done this kind of evaluation myself), but it doesnt give me detailed information about accuracy problems. Also, while I think Likert ratings of accuracy work for short simple texts, I suspect they are less meaningful for longer and more complex texts.
Anyways, instead of just complaining about the situation in my blog (although I have done plenty of this!), I decided to work with my student Craig Thomson to try to come up with a way of evaluating accuracy which we believe is robust, informative, and works for longer texts. I like to think this has been a true collaboration (where I did a fair bit of the work) instead of the usual student-supervisor paper.
Craig and I will present our work on A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems at INLG this year. We are working in the domain of generating summaries of sports matches, and essentially we ask human subjects to annotate mistakes in the generated texts (see example at end of this blog), using a protocol we have refined based on pilot experiments, and which can be used with Mechanical Turk workers. The protocol seems to work well; certainly there is good agreement between annotators, and also between annotations done by Turkers and annotations done by Craig and I. In addition to being a more reliable and “accurate” measure of accuracy than metrics or Likert ratings, our annotation protocol also identifies individual mistakes and classifies them into different types, including incorrect number, incorrect name, incorrect word (or phrase), and contextually misleading. NLG developers should be able to use this information to improve their systems.
We have only tested the protocol in the sports domain, but I think it can work in other data-to-text domains as well, and would love to get feedback from people who try this approach in other domains. Also the protocol of course can be improved!
While I believe our protocol is effective, it is not cheap; annotating a 300-word sports summary requires US$30 in Mechanical Turk fees, plus 30 minutes of experimenter time. Since this may be too expensive for many contexts, Craig and I will also launch at INLG a shared-task for evaluating accuracy (previous blog). Participants in the shared task can propose cheaper ways of measuring accuracy (including both metrics and human protocols) which will be validated against the “gold-standard” protocol that Craig and I have developed. So if you think you have a clever way of evaluating accuracy based on a metric or different kind of human evaluation, please let me know and join the shared task!
Issue: Accuracy vs Selectivity
We need to define accuracy before we can ask people to annotate it, and this is not straightforward. This is discussed more in the paper, but I wanted to mention two high-level issues here: accuracy vs selectivity, and factual accuracy vs data accuracy.
Looking at the first of these, accuracy checks whether the information in a text is correct. This is not the same as evaluating whether the information is useful or important. In data-to-text contexts, there are usually zillions of accurate facts and insights which we can communicate in a text, and the NLG system should choose the most important facts and insights. This is called content selection.
For example, if an NLG system is summarising a patient’s demographics in a medical context, describing age and gender is usually much more useful than describing birthplace and month of birth.
In principle, I think we could measure the success of content selection by looking at the facts and insights communicated by the generated text, and comparing this to the facts and insights communicated by a gold standard text (which in the above example might mention age and gender). Recall (ensuring that we communicate important facts) will be more important than precision (including a few unimportant facts is a nuisance but probably acceptable). This perhaps has some similarity to the pyramid technique used in text summarisation.
But anyways, the point I want to make here is that evaluating whether a text communicates the most important facts is different from evaluating whether a text is factually accurate. Both kinds of evaluation are hugely important, but I suspect factual accuracy comes first; if users do not trust a NLG text to be accurate, they will not use it regardless of the quality of its content selection.
Issue: Factual accuracy vs data accuracy
Another issue in evaluating accuracy in data-to-texts systems is whether we (A) check if a text accurately communicates real-world information, or (B) check if it accurately communicates the system’s input data.
For example, assume the NLG system is generating a description of a sports game, lets say a match where the Milwaukee Bucks defeated the Charlotte Hornets. Lets further assume that the input data does not explicitly give the location of the match, but the generated text says
The Milwaukee Bucks defeated the Charlotte Hornets in Charlotte.
Of course this is an accuracy error if the game was played somewhere else! But what if the game was in fact played in Charlotte (the NLG system got lucky in its guessing), is this an accuracy error?
Craig and I believe that the above statement should only be treated as an accuracy error if it is factually incorrect (option A above); it doesnt matter whether this information is present in the input data. This is discussed in more detail in the paper, but one factor here is that we want to allow machine learning systems to make probable-but-not-perfect inferences. For example, if the input data does not give a location but does say that the game is a “home” game for Charlotte, it is very likely that the game was played in Charlotte but there are some exceptions to this rule (eg, a recent NBA Global Game where the Charlotte Hornets held a regular season “home” game in Paris).
In other words, a strict version of “accurately communicates data” would always consider “The Milwaukee Bucks played the Charlotte Hornets in Charlotte” to be false, regardless of the game’s actual location, even if we knew that there was a 99% probability that the game was played in Charlotte because it was a “home” game.
Other researchers may disagree and believe that accuracy should be based on faithfulness to input data instead of faithfulness to the real world. And perhaps different rules should apply in safety-critical domains such as medicine. We welcome further discussion in the community about this topic.
The NLG research community needs to take accuracy very seriously, and this includes developing robust ways of measuring accuracy as well as technologies which generate accurate texts. I hope that our work helps with the first goal, both by providing a robust technique for measuring accuracy in data-to-text systems, and also by providing a “gold standard” for researchers who are exploring cheaper and quicker ways to measure accuracy.
Appendix: Example Annotation
Below example is an example of an text which has been annotated for factual errors (game data is available on basketball-reference.com). It is a shortened version of an example in our paper
The Memphis Grizzlies (5-2) defeated the Phoenix Suns (3 – 2) Monday 102-91 at the Talking Stick Resort Arena in Phoenix. The Grizzlies had a strong first half where they out-scored the Suns 59–42. Marc Gasol scored 18 points, leading the Grizzlies. Isaiah Thomas added 15 points.
List of errors:
- 2: incorrect number, should be 0.
- Monday: incorrect named entity, should be Wednesday.
- Talking Stick Resort Arena: incorrect named entity, should be US Airways Center.
- strong: incorrect word, the Grizzlies did not do well in the first half.
- out-scored: incorrect word, the Suns had a higher score in first half.
- 59: incorrect number, should be 46.
- 42: incorrect number, should be 52 .
- leading: incorrect word, Marc Gasol did not lead the Grizzlies, Mike Conley did with 24 points.
- Isaiah Thomas added: context error, Thomas played for the Suns, but context here implies he played for the Grizzlies and added to their score.
11 thoughts on “Evaluating Accuracy”
thank you. very interesting topics/issues!
As for the machine learning “auto complete” – letting the NLG completes sentences by itself, based on machine learning is interesting and might be acceptable for use cases such as weather and sports.
Looking at finance – completing a sentence about an investment portfolio by having the machine learning add something like “is of higher risk” due to allocation of certain assets, is unacceptable (while it could be factually correct based on every measure).
Ofer – Hi, you’re absolutely right that sometimes words/phrases are not just factual but also convey judgements and inferences. We see a bit of this in sports-reporting, but there would be a lot more in finance and medicine! Something we should think about.
More generally, above is aimed at academics. From a commercial perspective, I guess you could describe what we’re doing is developing a quality-assurance protocol for finding factual mistakes in narratives generated from data. The protocol is time-consuming but well-enough defined that we can trust contractors to do it, provided contractors have some domain knowledge and pass a screening test.