I’ve written a number of blogs about the importance of accuracy; texts produced by NLG systems must be accurate! But of course in the real world 100% accuracy is usually not possible, and indeed human writers make mistakes, they are not 100% accurate. So perhaps a better goal for NLG systems is to be at least as accurate as human writers.
Three of my PhD students (Francesco Moramarco, Barkavi Sundararajan, and Craig Thomson) have recently explored how many mistakes human authors (professionals, not Turkers) make when they do NLG-like tasks. Results are very interesting, not least because the number is considerably higher than we expected. I summarise some of their findings below.
Human errors in summarising consultations
Francesco has been looking at summarising doctor-patient consultations; this is an important real-world task since a summary of such consultations must be entered into the patient’s medical record. He and his collaborators have a paper at ACL22 (arxiv link) which describes an experiment where doctors were asked to post-edit and correct summaries of consultations. While most of these summaries were produced by NLP systems, some of them were written by people (other doctors), so Francesco’s experiment sheds light on how many mistakes doctors make when they summarise a consultation. Note that the doctors were summarising a consultation done with an actor, not with a real patient, so they knew that their actions would have no impact on actual patient care.
Anyways, on average the post-editing process identified 3.9 omissions and 1.3 incorrect statements in each human summary. The omissions largely reflect differences in clinical opinion as to which information is sufficiently important to be included in the summary; some doctors write comprehensive summaries and others focus on a smaller number of key facts. The incorrect statements are more surprising, and perhaps reflect the time constraints and need for multi-tasking in a medical consultation setting. These are described in detail in the paper, but at a high level
- Some of these “incorrect statements” were mistakes by the post-editors, not the original authors.
- Some of these incorrect statements were based on clinical inferences which were likely-but-not-guaranteed to be true; Thomson and Reiter (2020) made a similar point about inferring stadium from home team when describing a basketball game. This raises the question of how we define “accuracy”; is an inference with a 99% probability of being correct an accuracy error?
- Some of these incorrect statements were clear errors; many of these occurred when reporting relatively unimportant details.
Unfortunately deciding which category each report error falls into requires considerable domain expertise, so Francesco was not able to present a statistical breakdown of how many errors fell into each category.
Human errors in writing sports stories
Craig and Barkavi have been looking at accuracy errors in sports stories written by sportswriters. They essentially used a modified version of the protocol of Thomson and Reiter (2020) (which was developed to find mistakes in texts from neural NLG systems) to find mistakes in some of the human-authored reference summaries in the SportSett corpus.
This work has not yet been formally published, so I dont want to go into a lot of detail here. But overall they found that a 300-word human-authored sports story contained 1.5 errors on average. This is much lower than the number of errors found in the neural NLG texts (15-20 errors on average, depending on the system)! But it was much higher than we expected, and shows that Thomson and Reiter (2020) is incorrect when it states (page 165) that the acceptable error rate for such stories “almost certainly would need to be less than one error per story” (mea culpa: I wrote this sentence).
Again I dont want to go into detail here, but most of the errors that Craig and Barkavi found were incorrect numbers, often because the sportswriter seemed to have copied the wrong number into the story (which is also a common error made by neural NLG systems).
Note that it is possible that users will have higher expectations and standards for NLG systems than for people. We’ve seen in other areas of AI (including medical diagnosis and autonomous vehicles) that an error rate which is tolerated for humans is not acceptable for AI systems. If the same applies for NLG, then better-than-human accuracy will be essential for real-world usage.
Implications for NLG
I have said elsewhere that NLG texts must be accurate, as have others. But the work of Francesco, Barkavi, and Craig shows that we need to be more nuanced in our expectations, including the below points:
- We need to define what we mean by “accuracy”. Should an inference which is likely-but-not-certain be treated as an accuracy error?
- We need to understand what level of accuracy is expected by users in different contexts. I suspect I was wrong when I wrote that one accuracy error (on average) in a sports story is unacceptable.
- We probably should differentiate between important and unimportant accuracy errors, especially in task-oriented contexts such as medicine.
Finally, I think the lower-than-expected accuracy of human texts could be an opportunity for NLG. It suggests that rule-based NLG systems in particular can potentially be more accurate than human writers in some use cases (such as sports reporting). Which could encourage more people to use NLG!