Craig Thomson and I have just finished running a shared task on evaluating accuracy (ie, finding factual mistakes) in texts produced by data-to-text neural NLG systems. The shared task will be presented at INLG 2021, and our summary paper is on arxiv, with datasets on Github. I think finding accuracy errors is a very important and interesting task, and I encourage other people to “have a go” using the data on the Github site!
Anyways, the shared task gave us insights about both the mistakes made by neural NLG systems, and also mistakes which were hard to detect by neural evaluation techniques. Among other things, we saw that neural NLG systems struggled with some words which have fairly clear rule-based definitions. I give some examples below.
First of all, I should explain the shared task. The goal was to find factual errors in summaries of basketball games which were produced from basketball box score data by neural NLG systems. Below is an extract from such a text, which has been manually annotated for factual errors (full details are given in a previous blog). Errors are underlined. The data for this game is available on basketball-reference.com.
The Memphis Grizzlies (5-2) defeated the Phoenix Suns (3 – 2) Monday 102-91 at the Talking Stick Resort Arena in Phoenix. The Grizzlies had a strong first half where they out-scored the Suns 59–42. Marc Gasol scored 18 points, leading the Grizzlies. Isaiah Thomas added 15 points.
This example shows different types of errors
- Incorrect numbers: For example 59–42 should be 46-52.
- Incorrect names: For example, Talking Stick Resort Arena should be US Airways Center
- Incorrect word: the Grizzlies did not out-score the Suns
- Context error: Isaiah Thomas played for the Suns, but the above contextually implies he played for the Grizzlies
Participants in the shared task were given 60 manually annotated texts for training and development; we held back a test set of an additional 30 texts. Texts were around 300 words long on average, and contained 20 errors on average. Which (if I put on my “commercial” hat) is far too high for a real-world sports journalism application!
In the rest of this blog, I will focus on incorrect word errors. The others are also interesting, you can learn more about them in our paper.
The most common incorrect word error in the training set was “led”. “Led” is interesting because it can be used in many different ways (“the team led at the half”, “player X led his team”, etc) and also its meaning can sometimes be fuzzy or vague. For example when comparing two players A and B, if A scored slightly more points than B but B had many more rebounds and assists, we might say that play B “led” the team.
Because of this fuzziness, we hoped that neural NLG systems could learn how to appropriately use the word. But this was not the case, our systems made many mistakes, many of which were blatant (eg saying that a team “led at the half” when it was behind).
The second most common incorrect word error in the training set was “double-double”. A double-double occurs when a basketball player has ten or more (double-digits) in exactly two of the following categories: points, rebounds, assists, steals, and blocks. Note that if a player has ten or more in three of the categories, this is called a triple-double (3 statistics in double-digits) rather than a double-double.
In any case, while double-double is easy to define via rules, it seemed to be a difficult concept for our neural NLG systems to learn.
Example: only other
The above examples refer to corpus texts. If we look at the submissions to the shared task, they struggled to detect certain kinds of errors, including the use of “only other” in statements such as “The only other Net to reach double figures in points was Ben McLemore.” Note that this usage of “only other” suggests that (A) McLemore scored at least 10 points, (B) other Net players scored at least 10 points, and (C) all of these other Net players were previously mentioned in the text.
In other words, “only other” has a clear rule-based definition, but it is complex, and depends on what was previously mentioned in the text and the performance of other players as well as the performance of the player in question. This seems to be difficult for neural systems to learn.
I once wrote a blog entitled Lexical Choice Needs Machine Learning, where I argued that word (lexical) choice in NLG should in part be learnt from data. I still believe this, but the above examples suggest that current neural NLG approaches are not sufficient for lexical choice. We need better ML approaches and/or to allow some words to be defined by rules. Perhaps there is a lesson here for other NLG tasks as well.