In our reading group at Aberdeen, we recently read about some systems which generate summaries of a sports (basketball) game, from data about the game, using deep learning techniques. One thing we noticed was that the summaries were inaccurate. They had factual mistakes, where the summary contradicted the data (eg, incorrect numbers, or saying player X played for Team A when in fact he played for Team B). They also had hallucinations, where the summary included facts which were not in the data (eg, claiming that Team A’s next match was with Team X, when this was not in the data, and in fact in reality Team A’s next match was with Team Y). And what really surprised me was that the authors did not seem to regard this as major problem. I’ve seen this in other deep learning papers as well; authors are fixated on BLEU scores and perhaps human assessments of fluency, but regard factual accuracy as unimportant.
This is a bizarre perspective because one thing that 30 years of NLG research has taught me is that readers of NLG texts care hugely about accuracy, and indeed prefer accurate-but-poorly-written texts over inaccurate-but-fluent texts. After all, if you need to make a decision from a text, you can still probably extract the information you need from an accurate but poorly written text, although it will be a hassle. Whereas an inaccurate text may mislead you and cause you to make a poor decision.
We’ve also shown experimentally that users care more about accuracy than about readability, see page 552 of Belz and Reiter 2009.
So accuracy matters, and certainly is taken **very** seriously by Arria and other commercial NLG vendors (who worry about lawsuits as well as misleading clients). Hence I find it disappointing that many researchers place so little importance on it. As long as they keep on ignoring accuracy, their research will have little relevance to the real world, and indeed they will find that users prefer boring-but-accurate templated texts over fluent-but-inaccurate texts produced by their whizzy neural systems.
If we care about accuracy in generated texts, we will need to evaluate it. How do we do this? I think there are two aspects to think about
- Is everything in the text factually correct?
- Is everything in the text derivable from the data?
(2) is essentially a check for hallucination. For example, assume that a certain team usually-but-not-always sings a victory song after winning a game. If a generated text says that the team sang their victory song but this is not in the source data, then I regard this as inaccurate even if the team did in fact sing their victory song on this occasion.
Of course, content evaluations of NLG systems need to look at coverage as well as accuracy; did the generated texts communicate the key messages and insights which the user needs or wants to know? But I’ll ignore this aspect of content quality here, although it is of course hugely important!
So how do we evaluate a generated text for accuracy? I suggest the following:
- Analyse the text and extract the messages it communicates. With a rule-based NLG system, we can perhaps directly get semantic content from the output of the document planner. However, when evaluating neural NLG, I think we will need to parse and analyse the generated text.
- If we have trusted reference texts, we can check the extracted messages to see if they are in any of the reference texts. If so, they are probably correct, although we should also check that they are derivable from the data ((2) above)
- Fact-check any messages which are not in the reference texts against the source data. The fact that a messages is not in the reference text does not make it wrong, there usually are plenty of innocuous and valid messages which can be added to a summary.
In an ideal world much or indeed all of the above could be automated. However in 2019, we will need to use people in the above process, especially the last fact-checking step. Note that because fact-checking is time-consuming, requires domain knowledge, and must be done carefully and consistently, we probably cannot use Mechanical Turk and similar crowdsourcing platforms.
One important point is that we cannot use metrics such as BLEU to evaluate the accuracy of generated texts! Indeed, in general BLEU is useless at evaluating the content quality of generated texts (Belz and Reiter 2009).
In short, evaluating the accuracy of generated texts is a hassle (and I speak from experience as well as theoretically), because we need a lot of human input and it is difficult to use crowdsourcing. Which is probably why many researchers avoid doing this. But it is important and needs to be done!