evaluation

Real-world utility is based on many things

A few weeks ago I gave a talk on High Quality Human Evaluations of NLG at a workshop (see also related blog). One of the points I tried to make is that that what we want to know how useful NLG systems are in real-world settings, and that real-world utility depends on a range of different factors. In other words, even if we have good techniques for measuring the fluency and accuracy of a generated text, this is not sufficient to measure real-world utility.

Example: Summarising a Medical Consultation

Let me give a concrete example, which is based on the work of Francesco Moramarco, one of my PhD students. Francesco is trying to evaluate a summarisation system which generates a written summary of a consultation between a doctor (GP) and a patient; this summary could then be added to the patient record and perhaps shown to the patient. Below is an example from one of Francesco’s papers

Consultation (input)

Doctor: Hello? Good morning, Tim. Um, how can I help you this morning?

Patient: Um, so I’m having some, some pain, uh, in my tummy, like the lower part of my tummy. Um and I’ve just been feeling, quite, hot and sweaty.

Doctor: OK. Right, I’m sorry to hear that. When, when did your symptoms all start?

Patient: About two days ago.

Summary (output)

Two days of lower abdominal pain.

Because accuracy is of paramount important in medicine, the summaries must be checked and post-edited by the doctor before they are saved into the medical record. The usefulness of the system is thus largely based on the post-editing process, including

  • How long does it take a doctor to post-edit the summary? Doctor’s time is expensive, we want post-editing to be quick.
  • Does post-editing distract the doctor or otherwise interfere with his/her workflow? If it requires a lot of cognitive effort to post-edit the summary, this could disrupt workflow.
  • Is the post-edited text complete and accurate?
  • Do doctors like using the system? If they don’t, then success in real-world deployment is unlikely.

Also, in this context we need to understand distribution (especially worst-case behaviour) as well as averages. We know people differ widely in the time taken to post edit (paper), because some people just fix major problems while others rewrite texts more compresively. So if most doctors post-edit quickly but it takes a few doctors a long time to post-edit, this is important. Also we need to be confident that serious hallucinations or omissions are very rare.

Now, most existing evaluations of summarisation systems attempt to judge the quality of the generated summary. This is true for human evaluations where Turkers give Likert ratings to texts as well as for evaluations based on ROUGE and other metrics. The quality of the generated summary is likely to influence the things I mentioned above (such as the effort required to post-edit), but so do other things, including the user-interface used for post-editing. And of course existing evaluation techniques focus on average case performance, not on worst case.

In other words, many of the things we want to measure when evaluating the consultation summariser in real-world usage are different from what is measured in most academic evaluations. There may be a link and correlation between the two, but if we want to use ROUGE-like metrics (or Likert ratings from Turkers) to predict real-world utility in this use case, we should get concrete data about real-world utility (including things like post-edit time) and measure how well this correlates with ROUGE (and Turker ratings). We cannot just assume this correlation exists, we need to empirically demonstrate it.

Other things which influence real-world success and utility

The above example is representative (at least in my experience) in the sense that when we deploy an NLG system in real production usage, success is usually influenced by a number of factors, many of which are specific to the use case. Some of the other factors which I have seen are

  • Response time: is a text generated quickly?
  • Brand fidelity: does a generated text conform to and support a corporate brand?
  • Control: does the NLG system reduce the user’s sense of control over what he is doing?
  • Risk: Is there a risk (perceived or real) that the NLG system will produce a text that does real damage (injures people, leads to bad publicity, opens door to lawsuits).

I could easily expand this list, not least by adding issues related to change management; interested readers can also look at my summary of the INLG2021 industrial panel.

What should academics measure?

I am not saying that academics should routinely try to measure the above factors when evaluating systems! But they need to be aware of them, and the field would greatly benefit from some “high quality human evaluations” which investigate the above and assess how well simple evaluations (metrics and Turkers) correlate and predict these factors. Francesco (my student) certainly hopes to do such studies, and I encourage other researchers to do likewise!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s