Current neural NLG systems struggle to generate accurate texts. Many researchers now acknowledge this and are trying to address this problem, which is a very encouraging development. However most of the research I see focuses on simple factual errors (hallucinations). These are certainly very important, but there are other ways in which NLG texst can be inaccurate, I describe some of these below.
I realise that it makes sense for researchers to focus on addressing simple accuracy problems first, I am absolutely fine with this! But I want to point out that resolving simple factual mistakes is not sufficient, we will have to address more complex mistakes as well if we want neural NLG systems to be used in contexts where accuracy matters.
Craig Thomson and I are looking at accuracy mistakes in summaries of basketball games produced by systems based on Rotowire or similar corpora. Below I show one such example of a summary of a basketball game produced by a neural NLG system, which has been annotated for mistakes by someone who is knowledgeable about basketball. If anyone wants to check, the underlying data for this game is available at https://www.basketball-reference.com/boxscores/201501040NYK.html
The Milwaukee Bucks ( 18 – 17 ) defeated the New York Knicks ( 5 – 31 ) 95 – 82 on Wednesday at Madison Square Garden in New York . The Bucks were the superior shooters in this game , going 46 percent from the field and 25 percent from the three – point line , while the Knicks went just 41 percent from the floor and a meager 36 percent from beyond the arc . The Bucks were the superior shooters in this game , going 46 percent from the field and 25 percent from the three – point line , while the Knicks went just 41 percent from the floor and a meager 36 percent from deep . The Bucks also forced the Knicks into 18 turnovers , while committing just 11 of their own , which may have been the difference in this game , as the Bucks forced the Knicks into 21 turnovers , while committing just 11 of their own . The Bucks ‘ frontcourt did most of the damage in this game . Giannis Antetokounmpo led the team with 16 points ( 6 – 9 FG , 1 – 1 3Pt , 3 – 6 FT ) , 12 rebounds and two blocked shots , while Middleton had eight points ( 2 – 6 FG , 1 – 2 3Pt , 3 – 6 FT ) , five rebounds and two steals in 22 minutes . It was the first time he ‘s surpassed the 20 – point mark this season , so it was good to see him get things turned back around . Coming off the bench , Khris Middleton had eight points ( 2 – 6 FG , 1 – 2 3Pt , 3 – 3 FT ) , five rebounds and two steals in 22 minutes . The Bucks ‘ next game will be on the road against the Boston Celtics on Friday , while the Knicks will travel to Brooklyn to play the Nets on Friday
Everything that is red underlined is an error. Many of the errors are simple factual errors, such as the day the game was played (Sunday, not Wednesday) and the number of turnovers (Knicks committed 16 turnovers, not 18 turnovers or 21 turnovers; note the NLG text is internally inconsistent as well as not matching real-world data). There are also aspects of this text which are not accuracy errors but probably would make it unacceptable to real users, such as the repetition of the second sentence. But anyways, I want to focus on accuracy errors in the above which go beyond getting a number, day-of-week, or name wrong.
Inappropriate use of words
This occurs when a text uses a word (including vague terms) in a way which is inappropriate. For example
the Knicks went just 41 percent from the floor and a meager 36 percent from deep
Calling 36 percent meager is inappropriate since (A) it is above the league average (35%) and (B) it is above the other team’s performance (25%). The word “meager” does not have a crisp meaning, but this usage is way beyond what is acceptable.
A more complex example of this is
The Bucks ‘ frontcourt did most of the damage
Our annotator disputed this, since he didnt think the frontcourt players did better than the others, and indeed the top-scoring player was not in the frontcourt.
Incidentally, we asked for a second opinion on this example, and the second expert had a different view. A basketball team has 5 players: 2 forwards, 1 center, and 2 guards. “Frontcourt” is usually used to refer to the center and forwards, ie 3 players in all (although some people use the word to refer to the center and just one of the forwards). At any rate, if we interpret “frontcourt” to mean 3 players, then even if frontcourt players have the same performance as non-frontcourt players, they will still “do most of the damage” because they constitute most of the team (3 of the 5 players). Such a statement does not provide useful insights on the game, but it is still accurate. The second expert also pointed out that while the frontcourt players had similar scoring to non-frontcourt, they did better at rebounds.
Hence, we can see that judging accuracy requires good knowledge of the domain and genre, and even then people may disagree.
Implying incorrect attributes
A good example of this is in another game summary we looked at (not the one above)
The Suns had six players reach double figures in points . Mike Conley led the way with 24 points …
In this case Mike Conley did indeed score 24 points, however he scored 24 points for the other team (not the Suns). The above narrative strongly suggests to readers that Conley played for the Suns, which is incorrect.
Another problem, which is not strictly accuracy but nonetheless confuses people, is when the position of a fact in a narrative suggests that it is important, when it was not important. For example,
The Milwaukee Bucks ( 18 – 17 ) defeated the New York Knicks ( 5 – 31 ) 95 – 82 on Wednesday at Madison Square Garden in New York . The Bucks were the superior shooters in this game…
Looking at the second error, superior shooters, our annotator complained that this suggested that superior shooting was a major contributor to the Bucks victory, when he thought it was unimportant, compared to differences in rebounds and free throws.
Another example is
Giannis Antetokounmpo led the team with 16 points ( 6 – 9 FG , 1 – 1 3Pt , 3 – 6 FT ) , 12 rebounds and two blocked shots , while Middleton had eight points ( 2 – 6 FG , 1 – 2 3Pt , 3 – 6 FT ) , five rebounds and two steals in 22 minutes
Again looking at the second error, Middleton, the annotator felt the mention of Middleton was misleading because it suggested he was a key player when he was not; 5 players on his team scored more points than Middleton did.
It is not clear whether incorrectly implying importance constitutes an “accuracy” error or hallucination, I’d be interested in hearing other people’s opinions about whether this is in fact an accuracy error. However it certainly is something which could confuse or mislead readers, and hence would be unacceptable in a high-quality data-to-text system.
Accuracy is more than getting simple facts right
If we want NLG systems to be used in contexts where accuracy matters (which every NLG use case I have ever worked on), then we need to ensure that none of the above types of mistakes occur! Eliminating simple factual errors (like number of turnovers in the above example) is a good start, but it is only a start.
11 thoughts on “Accuracy Errors Go Beyond Getting Facts Wrong”
Can you advise one or two best practices when designing NLG, to try to avoid or overcome these inaccuracies?
In some cases, the “inaccuracies” could be considered or attributed to a different point-of-view/opinion or interpretations (like with the “superior shooters” example). If we want to make the NLG to be absolutely accurate (as well as to be perceived like that), then I would recommend trying and avoid phrases that are border-line or could be interpreted differently by different people…
Hi, different people use words differently, which means that in “borderline” cases a usage may be acceptable to some people but not others. In our SumTime project many years ago, we did a fair amount of empirical work to understand how writers used words and how readers interpreted words, identified the words which were the least ambiguous, and had the NLG system use these. Results were good, indeed some users preferred SumTime texts to human texts because our word usage was less confusing and ambiguous.
E Reiter, S Sripada, J Hunter, J Yu, and I Davy (2005).
Choosing Words in Computer-Generated Weather Forecasts.
Artificial Intelligence 167:137-169. (https://doi.org/10.1016/j.artint.2005.06.006)
E Reiter and S Sripada (2002).
Human Variation and Lexical Choice.
Computational Linguistics 28:545-553 (https://www.aclweb.org/anthology/J02-4007.pdf)