After INLG 2019, a group of us started talking about setting up shared tasks on evaluating NLG systems. That is, shared tasks where the goal was not to show that your system produces the best texts, but rather that your evaluation technique gives the best assessment of the quality of texts generated by NLG systems. We are especially keen to explore different ways of doing human evaluations (van der Lee et al 2019). If anyone is interested, we have a mailing list and are planning to meet up in Edinburgh on 13 March.
Anyways, at Aberdeen we have come up with an idea for a shared task on evaluating the accuracy of texts produced by NLG systems. I describe what we are thinking of below. I’d love to get feedback about what we have in mind. And if anyone thinks they might want to participate in the challenge, please let me know! If there isn’t much interest in a shared task, we’ll just run experiments ourselves, for example to measure the effectiveness of Turkers compared to domain experts in detecting accuracy problems.
We will test the proposed evaluation techniques on sports summaries produced from the Rotowire dataset (Wiseman et al 2017). Rotowire texts are much longer than texts produced in the E2E and WebNLG domains, and have many different kinds of accuracy problems, so we think this is a good domain for evaluating accuracy.
Below is an example Rotowire output text from Table 2 of Wiseman et al, which includes accuracy errors in red (most of these errors are mentioned in the paper, but I have added a few additional ones which were not in the paper):
The Utah Jazz ( 38 – 26 ) defeated the Houston Rockets ( 38 – 26 ) 117 – 91 on Wednesday at Energy Solutions Arena in Salt Lake City . The Jazz got out to a quick start in this one, out – scoring the Rockets 31 – 15 in the first quarter alone . Along with the quick start , the Rockets were the superior shooters in this game , going 54 percent from the field and 43 percent from the three – point line , while the Jazz went 38 percent from the floor and a meager 19 percent from deep . The Rockets were able to out – rebound the Rockets 49 – 49 , giving them just enough of an advantage to secure the victory in front of their home crowd . The Jazz were led by the duo of Derrick Favors and James Harden . Favors went 2 – for – 6 from the field and 0 – for – 1 from the three – point line to score a game – high of 15 points , while also adding four rebounds and four assists ….
We can see that there are many different kinds of accuracy mistakes
- Incorrect value which can easily be checked against the data (eg, nba.com). For example, the Jazz lost the match, they did not defeat Houston.
- Incorrect values which require a bit of effort to check. For example James Harden played for the Rockets, not the Jazz. Checking this requires searching through the names of all players on both teams.
- Incorrect value which requires looking elsewhere. We have seen examples in other stories where incorrect statements are made about previous games (see Appendix). Checking this requires looking up data on the previous games.
- Inappropriate word usage. It is not appropriate to say that The Jazz got out to a quick start in this one if we look at their actual performance (see game flow on ESPN).It is also not appropriate to say that Derrick Favors led the Jazz, other players scored more points, had more rebounds, etc; and it is questionable to say that the Rockets had just enough of an advantage to win (the game was not close). Note that detecting inappropriate word usage requires some knowledge of the domain and of how people write about basketball games.
- Inappropriate word usage requiring extra knowledge. We have seen other examples where misleading words are used about previous games (see Appendix).
The above is not meant as a criticism of Wiseman’s work!! I’ve seen similar accuracy errors in texts produced by other Rotowire systems; I give an example in the Appendix.
So anyways, from the perspective of evaluation, the challenge is devising techniques (protocols for human evaluation, or indeed automatic metrics) which can detect accuracy errors of the above type. We could of course extend the above list, for example including cases whether the system “hallucinates” content which is not in its input data but (purely by luck) is accurate.
The process we used to identify the above errors in the stories required domain knowledge of basketball and took about 30 minutes per text. It would be great if participants in the shared task could figure out a reliable way of identifying accuracy mistakes which was quicker and/or did not rely on specialist knowledge!
Our idea is something like the following
Task: Develop a protocol for human evaluations, automatic metrics, or some combination of these which identifies inaccurate words, phrases, or sentences in a basketball summary. Mistakes should be reported by annotating a single key number, word or name in the story, such as 38, James Harden, or led. The annotation does not need to indicate the type of accuracy mistake.
Scoring: We will separately compute
- recall on each of the five types of accuracy mistakes mentioned above, eg Incorrect value which can easily be checked against the data
- overall precision (ie, how many of the accuracy annotations represented genuine mistakes)
Training data: Organisers will create a set of 200 stories about games, including
- stories produced by systems based on Rotowire (there are quite a few)
- human-written texts
- human-written texts with mistakes of the above types added
These stories will be annotated for accuracy mistakes based on careful fact-checking. This will constitute a “gold standard” for accuracy evaluation. This annotation will be done by participants as well as the organisers. Ie, organisers will release an initial set of 20 accuracy-annotated stories, and participants will each be asked to annotate an additional 20 stories.
If participants want more data on additional games, they are free to obtain and annotate additional stories!
The organisers will also create a test set of 20 annotated stories.
Timeline (maybe too ambitious??)
March: Task announced with a few example accuracy-annotated stories. Participants enrol and are asked to annotate stories.
April: Full training data is available.
July: Participants submit papers about their approaches. Organisers release the stories in the test set (not the gold-standard accuracy markup). Participants have two weeks to run the test set through their approach, and report their results.
August: Organisers compute precision and recall of each submission, and write a summary paper
September: Shared task presented at INLG (either main conference or a workshop).
Appendix: Another example of accuracy errors in a Rotowire story
As mentioned above, accuracy errors are common in stories produced by Rotowire systems. Craig Thomson has analysed a few texts produced by the system of Puduppully et al 2019, below I give an extract from one of Craig’s analyses. Note that Craig ran the code himself, this example is not from the paper.
The Atlanta Hawks ( 46 – 12 ) defeated the Orlando Magic ( 19 – 41 ) 95 – 88 on Wednesday at Philips Arena in Atlanta … The Hawks also dominated the rebounding , winning that battle , 28 – 16 . The Hawks were led by Jeff Teague , who finished with 17 points ( 6 – 15 FG , 1 – 4 3Pt , 4 – 4 FT ) , seven assists and two steals . He ‘s been extremely effective over his last two games , combining for 39 points , 14 rebounds and five assists . … Both Jodie Meeks and Kyle O’Quinn reached double figures off the bench .
This fragment again illustrates many different kinds of accuracy mistakes
- Incorrect value which can easily be checked against the data (eg, nba.com). For example, the Hawks had 42 rebounds, not 28
- Incorrect values which require a bit of effort to check. For example Jodie Meeks did not play for either of these teams (he played for Detroit Pistons at the time).
- Incorrect value which requires looking elsewhere. For example, the game was played on a Friday, not on a Wednesday. Also Teague got 3 rebounds over the last two games, not 14 rebounds. Checking this requires looking up the previous game.
- Inappropriate word usage. For example it in not accurate to say The Hawks were led by Jeff Teague, since another player (Millsap) scores more points, and a third player (Horford) had best overall stats. It is also questionable to say The Hawks dominated the rebounding, since they had 42 rebounds compare to 40 from the other team. Can we use the word “dominating” when the difference is so small?
- Inappropriate word usage requiring extra knowledge. For example, it is incorrect to say that Teague has been extremely effective over his last two games. Again, checking this requires looking up the previous game. and assessing whether this constitutes being extremely effective.