Generated Texts Must Be Accurate!

In our reading group at Aberdeen, we recently read about some systems which generate summaries of a sports (basketball) game, from data about the game, using deep learning techniques.  One thing we noticed was that the summaries were inaccurate.  They had factual mistakes, where the summary contradicted the data (eg, incorrect numbers, or saying player X played for Team A when in fact he played for Team B).  They also had hallucinations, where the summary included facts which were not in the data (eg, claiming that Team A’s next match was with Team X, when this was not in the data, and in fact in reality Team A’s next match was with Team Y).  And what really surprised me was that the authors did not seem to regard this as major problem.  I’ve seen this in other deep learning papers as well; authors are fixated on BLEU scores and perhaps human assessments of fluency, but regard factual accuracy as unimportant.

This is a bizarre perspective because one thing that 30 years of NLG research has taught me is that readers of NLG texts care hugely about accuracy, and indeed prefer accurate-but-poorly-written texts over inaccurate-but-fluent texts.  After all, if you need to make a decision from a text, you can still probably extract the information you need from an accurate but poorly written text, although it will be a hassle.  Whereas an inaccurate text may mislead you and cause you to make a poor decision.

We’ve also shown experimentally that users care more about accuracy than about readability, see page 552 of Belz and Reiter 2009.

So accuracy matters, and certainly is taken **very** seriously by Arria and other commercial NLG vendors (who worry about lawsuits as well as misleading clients).  Hence I find it disappointing that many researchers place so little importance on it.  As long as they keep on ignoring accuracy, their research will have little relevance to the real world, and indeed they will find that users prefer boring-but-accurate templated texts over fluent-but-inaccurate texts produced by their whizzy neural systems.

Evaluating Accuracy

If we care about accuracy in generated texts, we will need to evaluate it. How do we do this?   I think there are two aspects to think about

  1. Is everything in the text factually correct?
  2. Is everything in the text derivable from the data?

(2) is essentially a check for hallucination.   For example, assume that a certain team usually-but-not-always sings a victory song after winning a game.  If a generated text says that the team sang their victory song but this is not in the source data, then I regard this as inaccurate even if the team did in fact sing their victory song on this occasion.

Of course, content evaluations of NLG systems need to look at coverage as well as accuracy; did the generated texts communicate the key messages and insights which the user needs or wants to know?  But I’ll ignore this aspect of content quality here, although it is of course hugely important!

So how do we evaluate a generated text for accuracy?   I suggest the following:

  • Analyse the text and extract the messages it communicates.  With a rule-based NLG system, we can perhaps directly get semantic content from the output of the document planner.  However, when evaluating neural NLG, I think we will need to parse and analyse the generated text.
  • If we have trusted reference texts, we can check the extracted messages to see if they are in any of the reference texts.  If so, they are probably correct, although we should also check that they are derivable from the data ((2) above)
  • Fact-check any messages which are not in the reference texts against the source data. The fact that a messages is not in the reference text does not make it wrong, there usually are plenty of innocuous and valid messages which can be added to a summary.

In an ideal world much or indeed all of the above could be automated.  However in 2019, we will need to use people in the above process, especially the last fact-checking step. Note that because fact-checking is time-consuming, requires domain knowledge, and must be done carefully and consistently, we probably cannot use Mechanical Turk and similar crowdsourcing platforms.

One important point is that we cannot use metrics such as BLEU to evaluate the accuracy of generated texts!  Indeed, in general BLEU is useless at evaluating the content quality of generated texts (Belz and Reiter 2009).

In short, evaluating the accuracy of generated texts is a hassle (and I speak from experience as well as theoretically), because we need a lot of human input and it is difficult to use crowdsourcing.  Which is probably why many researchers avoid doing this.  But it is important and needs to be done!

 

 

7 thoughts on “Generated Texts Must Be Accurate!

  1. Thanks for this great and timely post.
    IMHO I think there is another related issue that needs to be addressed as well: the quality of training data & reference texts.
    Many recent neural NLG papers describe experiments conducted on datasets (e.g., WikiBio for biography and RotoWire for basketball games) that were crawled from different sources, without manual verification or filtering. Unlike manually written (guided by clear instructions) descriptions for specified input structures, such automatically derived parallel data could be noisy in many examples, with texts containing information not mentioned in the input. Models trained on semantically different input & output pairs could hallucinate by nature, while it makes very weak sense to compare different systems on such data in terms of accuracy evaluation.

    Like

  2. Respectfully, this strikes me as a somewhat strange take on what I presume to be the “Challenges in Data-to-Document Generation” and “Data-to-text Generation with Content Selection and Planning” papers. The whole point of the Challenges paper was that neural generation systems are not factually accurate, and it proposed some automatic methods for checking factualness (c.f., your first bullet-point). These factualness metrics have been adopted and improved upon by subsequent authors (especially the content selection and planning paper), and indeed many (deep learning) papers which use these datasets in fact report these factualness metrics alongside BLEU, precisely because BLEU has the drawbacks you mention.

    Like

    1. Sam – Hi, for clarity I am not referring to papers which you have authored. And I certainly appreciate that some neural NLG researchers do care about accuracy, although unfortunately many do not.

      About evaluation, though, while I appreciate that you conducted a human evaluation in your Challenges paper, I dont think that asking Turkers to spot-check random sentences is a good way of evaluating accuracy. In order to do this properly, you need to get someone who understands the domain to read and “fact check” the summary as a whole. I appreciate that this a lot of work (I’ve done this sort of evaluation many times)! If you think that your “spot check by Turkers” technique is a good way of estimating accuracy, then you should do a correlation study where you compare how well your technique predicts “gold standard” accuracy evaluations based on careful analysis by a domain expert.

      Like

  3. Hi Ehud – sorry for misinterpreting you, and thanks for your clarification and your point about the necessity of having experts doing the fact checking. I guess it’s less clear to me this is necessary for something as mundane as basketball game summaries (written for popular consumption), but you’re right that I haven’t shown that Turker evaluations correlate with expert evaluation.

    That said, there has been additional follow-up work (e.g., by Dhingra et al. at ACL 2019) showing that some of these automatic factualness metrics (including newer, better ones) do correlate with human (non-expert) judgments of factualness, and at least I believe that it’s going to be difficult to make progress in NLG without the benefit of such automatic metrics. In any case, while I can understand disagreeing with the claim that automatic factualness metrics can be useful, or even with the claim that non-expert factualness evaluations can be useful, I think my main point is that many in the neural NLG community do indeed care about generations being factual (and thus this line of research into automatic factualness metrics), even if we disagree about how best to establish that a generation is factual.

    Like

    1. Sam – Hi, as above I do appreciate that some neural NLG researchers (including you!) do care about accuracy, I just wish this was “all” rather than “some”.

      But anyways, about evaluation, we did some fact-checking in Aberdeen of the summaries in the papers we read, and discovered that we needed help from someone who knew more about basketball than we did, the task does require domain knowledge. Fact-checking is also time-consuming and must be done carefully, which means Mechanical Turk is not appropriate. So I am concerned about the fact that every paper I have read in this area (including yours and the Dhingra paper you cite) uses crowdsourcing for human evaluations. I think crowdsourcing can work for readability/fluency evaluations, but I am much more skeptical of using crowdsourcing for accuracy evaluations. Unless of course someone presents evidence that crowdsourced accuracy evals correlate well with acuracy evals by domain experts who take the time to do careful fact checking.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s