Perhaps the biggest challenge in neural NLG (including chatGPT) is accuracy; neural systems generate incorrect texts and this makes them unsuitable for a lot of use cases. In order to address this problem, we need to be able to properly evaluate accuracy. So I am very happy that Computer Speech and Language journal has just published a paper by Craig Thomson, Barkavi Sundararajan, and I on “Evaluating factual accuracy in complex data-to-text” (https://doi.org/10.1016/j.csl.2023.101482), which summarises work we’ve done on this topic. I give highlights below, and strongly encourage anyone who is interested in accuracy issues to read the paper!
The paper is not open-access, unfortunately. However it is available free-of-charge to all readers via this link until the middle of March. If someone wants to read it after mid-March and has problems accessing the paper, just let me know,
Background
Neural NLG systems “hallucinate” and generate texts which are incorrect. This was true in 2018 when we used LSTMs to generate 10-word restaurant descriptions (blog), and it is still true in 2023 when we use chatGPT to generate complex multi-paragraph texts (blog). Indeed, to me this is the most disappointing aspect of neural NLG. We’ve seen huge progress in technology, and clear progress in complexity and fluency of generated texts, but much less progress with accuracy. Which is frustrating because accuracy is the biggest barrier towards using the technology! For example, if we could modify chatGPT so that it always generated accurate texts, it would become a **lot** more useful in all kinds of use cases.
If we want to improve accuracy, we need to be able to properly measure it. Unfortunately, existing techniques for measuring accuracy leave much to be desired. There are a bunch of metrics, many of which are based on identifying and counting facts which are supported and contradicted by the input data. However, none of these have been shown to reliably detect accuracy problems, especially if we include more complex accuracy errors such as misleading words and incorrect pragmatic inferences. For human evaluation, the dominant approach is to ask crowdworkers to rate accuracy on a Likert-like scale, or rank a set of texts on the basis of accuracy; I dont think this works well, and also it doesnt give insights as to the type and severity of the errors.
Because of this, back in 2020 Craig and I (later joined by Barkavi) decided to try to develop a protocol (technique) for detecting and measuring accuracy problems that relied on asking people with domain knowlede to annotate individual errors in a text. In other words, this is a human evaluation, but its based on error annotation, not on rating or ranking texts. I increasingly think that this is the best way to evaluate the output of NLG systems, and its interesting to see similar approaches being taken elsewhere, such as MQM evaluation in machine translation.
Our protocol
The heart of our protocol is very simple. We ask annotators to read the texts and find accuracy errors, and then annotate the type of each error. Annotators are asked to look for statements which are incorrect in the real world, not statements which do not agree with the system’s input data. Since individual annotators can occasionally miss things or otherwise make mistakes, we recommend that each text be annotated by three people if possible, with the majority opinion being recorded.
Our core set of types (more are needed in some domains) are
- Incorrect number (including spelling out numbers as well as digits)
- Incorrect named entity (people, places, organisations, etc)
- Incorrect word (excluding above)
- Context error (statement is literally true (semantics) but misleading in context (pragmatics))
- Not checkable (annotator is not able to check accuracy of a statement)
- Other type of error
You can see examples in the paper, or indeed in one of my previous blogs.
Of course there are many details and special cases! Please read our paper to learn about these. In the paper, we also present data on annotator agreement, and on the results of an explicit verification exercise. The “bottom line” is that our protocol is of course not perfect and can miss things, but overall it does very well. Certainly I think it is far better than any other technique I have seen for detecting and measuring accuracy errors!
As described in the paper, our protocol can also be used to measure and detect accuracy errors in human-written texts, which can be a useful baseline/target for NLG systems. We did this in a sports reporting domain and were surprised by the number of errors we found (although this was still much less than the number of errors in the NLG texts we looked at).
Metrics
Our protocol is not cheap, it can cost US$30 to get 3 annotators to annotate a 300-word text. For this reason, we encourage researchers to use our protocol and results as a “gold standard” to evaluate cheaper techniques to assess accuracy. In other words, if someone thinks they have developed a great metric (or a cheaper human evaluation) to evaluate accuracy, they can validate its effectiveness by measuring how well the results of their technique correlates with our gold-standard protocol.
We ran a shared task on exactly this topic in 2021, which is described in the paper. Oversimplifying to some degree, the results suggested that metrics could detect most of the simpler errors (Name and Number), but struggled to detect the more complex errors (especially Context).
We encourage researchers who are interested in metrics for accuracy detection to use our data!
Final Thoughts
Accuracy is the biggest problem with neural NLG systems; this was true in 2018 and it remains true in 2023. Improving accuracy requires being able to measure it, and the protocol we developed works far better than anything else I have seen, at least for complex data-to-text tasks. I encourage anyone who cares about accuracy to read our paper. Needless to say, if you are interested in our protocol or ideas, feel free to contact me, I am happy to support and encourage more work in this area.
Citation
C Thomson, E Reiter, B Sundararajan (2023). Evaluating factual accuracy in complex data-to-text. Computer Speech and Language vol 80. DOI: https://doi.org/10.1016/j.csl.2023.101482
2 thoughts on “Evaluating factual accuracy in complex data-to-text”