There are many ways of evaluating NLG systems. In this post I present the options at a high level, without going into experiment design details (I discuss such details in other posts). In particular, I distinguish between
- task-based (extrinsic) evaluation, where we measure the real-world impact of an NLG system
- human ratings, where we ask people to rate the usefulness, readability, etc of NLG texts
- metrics, where we compare NLG texts against reference texts
For task-based and human ratings evaluations, I also distinguish between
- real-world evaluations, where the NLG system is deployed and used in a real-world context
- controlled evaluations, where the NLG system is used in an artificial experimental context
This material is adapted from a talk I gave at the 2015 NLG Summer School (slides), and from a talk I presented at NAACL 2016. I focus here on evaluations for academic research, commercial evaluations are a separate topic which I hope to discuss in a future blog entry.
Real-World Task-Based (Extrinsic) Evaluation (detailed advice)
The “gold standard” of evaluation is to try a system out for real and see if it has the desired real-world effect. For example, we evaluated STOP, an NLG system which produced smoking-cessation letters, by recruiting 2553 smokers, sending 1/3 of them letters produced by STOP and the other 2/3 control letters, and then measuring how many people in each group managed to stop smoking. The result was disappointing, because the numbers showed that STOP letters were not more effective than control letters. But that’s science, you dont always get the result you hoped for. We did report this honestly as a negative result, which I strongly encourage other people in this position to do.
This kind of study takes a lot of time and effort; the STOP evaluation took a year and a half to organise and execute, and cost around 75000 pounds. Partially this was because of the software engineering and ethical aspects of running software in the real world. Software run in a real-world context needs to be robust and to be integrated with data sources; if it is real-time, it needs to generate texts quickly. A system which people use in a real-world context also needs to be cleared from an ethical perspective. For example, STOP letters sometimes included advice about smoking-cessation techniques, is it possible that acting on this advice could hurt some people?
Real-world task/extrinsic studies also usually need lots of subjects because they are “noisy” in a statistical sense. For example, in a laboratory study we can insist that the subjects read texts in a quiet room where they can focus on what they are reading; whereas in a real-world study, some of the subjects may be quickly glancing at the letter while trying to deal with a screaming baby. Also, in most cases it is not possible to evaluate the impact of NLG and control texts produced from the same data. For example, if a smoker gave us data about his smoking, we could produce a STOP letter from this data, send it to him, and see if he managed to stop smoking; or we could produce a control letter, send it to him, and see if he stopped smoking. But we couldnt do both, so we couldnt directly compare STOP and control letters on the same subject and data.
In other words, when someone failed to stop smoking after getting a STOP letter, this could have been because the letter was rubbish, but it could also have been because there was no way this individual was going to stop smoking no matter what we told him, or because he didn’t even read the letter because his son doodled on it. Because of this “statistical noise”, we needed a lot of subjects in order to have a chance at seeing a statistically significant effect. This of course inflated costs and time scale.
Laboratory Task-Based Evaluation
We can do a task-based evaluation in a controlled laboratory environment instead of in real-world usage. In other words, we can give a subject a text, ask him or her to perform a task, and measure how well the subject performs the task. For example, we evaluated the Babytalk BT-45 system (which generated textual summaries of clinical data about babies in neonatal intensive care) by asking doctors and nurses to look at a text (BT45 or human-written) or visualisation of the data, and decide what action, if any, they would take. We then scored their responses against a “gold standard” (what top clinicians thought should be done in this context), and computed overall the liklihood of clinicians making the correct decision if they saw a BT45 or control text/visualisation.
Because such studies are done in a laboratory setting, they are much quicker and easier than a real-world task evaluation. Software engineering and ethics are much simpler, and we can control the environment and get rid of distractions. And we can prepare both NLG and control presentations of each data set, and compare their effectiveness, which also reduces noise and hence the number of subjects needed.
The downside of the laboratory setting is that we lose ecological validity. In the BT45 context, for example, asking a doctor to look at data about a patient in a quiet lab room, where the doctor just sees data summaries and cant see or hear the patient, is very different from asking a doctor to look at data in noisy hospital ward where he can see and hear the patient as well as look at the data.
Some people think ecological validity is very important, others do not. In the Babytalk context, for example, we worked with both psychologists and doctors; the doctors were much more concerned about ecological validity than the psychologists.
Real-World Human Ratings Evaluation (detailed advice)
In some cases it is difficult to directly measure whether an NLG system “works” and has the desired real-world effect, either because this is intrinsically difficult or because the above-mentioned statistical “noise” means that we need an impossibly large number of subjects in order to measure real-world effectiveness. In such cases, we can instead ask subjects to use the NLG system, and then fill out a questionnaire about the system. This is of course similar to the standard practice of asking users or customers to fill our “user experience” or “customer satisfaction” questionnaires after they use a product or service.
In an NLG context, we typically ask subjects to rate the usefulness, accuracy, and readability of generated texts (usually on a Likert scale), and also ask them for free-text comments on the system. The ratings are generally highlighted in evaluations, but often the free-text comments are the most informative result of this exercise.
The Babytalk BT-Nurse system, which generates nursing shift handover reports, was evaluated in this way, by both nurses who are finishing a shift and nurses who were starting a shift. The evaluation was fairly expensive and time-consuming because we had to deploy BT-Nurse in the hospital ward, which meant addressing many software engineering and ethical challenges (similar to those mentioned above for real-world task evaluations). The nurses gave the system reasonable but not stellar ratings, and their free-text comments highlighted many aspects of the system which needed to be improved, and also (more encouragingly) described some situations where BT-Nurse texts were very helpful to the nurses.
A human-ratings evaluation is certainly less rigorous and meaningful than a task-based evaluation, .We in particular need to keep in mind that human ratings do not necessarily correlate with effectiveness; for example a predecessor study to Babytalk showed that doctors preferred visualisations even though they made better decisions from textual summaries. But on the other hand, human ratings studies can give us broader insights about a system (especially if look at free-text comments as well as ratings) which would not be revealed by a task-base evaluation which was focused on testing a small number of specific hypotheses. And in some contexts a ratings study is the only type of human evaluation which we can carry out.
Laboratory Human Ratings Evaluation (detailed advice)
The most common way of evaluating NLG systems is to ask people to try them out in an artificial laboratory context, and then ask them to rate the system. This is the quickest and cheapest way of evaluating an NLG system with human subjects. Unfortunately it is also the least meaningful. There is (as always) a tradeoff between cost/time and rigour!
Although a laboratory human ratings study is not ideal, there are many cases where is the only option. In particular, this kind of evaluation makes sense if a system cannot be deployed in a real-world context because of ethical concerns or because the necessary software engineering is impractical; and if we cannot conduct a task-based evaluation because of statistical noise or the difficulty of measuring outcome. On the other hand, though, I certainly have seen cases where with a bit of effort a task-based evaluation or a real-world ratings evaluation could have been carried out instead of a lab-based ratings evaluation. I encourage anyone who is considering doing a laboratory human ratings evaluation to try to design alternative evaluations as a thought exercise, and assess whether such evaluations are in fact infeasible.
Metric Evaluations (detailed advice)
In a metric evaluation, we don’t ask humans to read the generated texts, instead we compare the generated texts against a collection of “gold-standard” reference texts. The comparison can be done using a variety of different metrics, including BLEU, METEOR, and ROUGE. Evaluations of this kind are very common in other area of NLP, including machine translation and document summarisation. Clearly decent results require high-quality reference texts, and best practice is to provide several reference texts for each input data set (scenario), since there are usually many acceptable ways of translating, summarising, or generating texts.
I have serious doubts about the validity of metric-based evaluations in NLG. Such evaluations are attractive because they are very cheap to carry out once he necessary material (eg, reference texts) has been assembled. In particular, metrics are by far the cheapest, quickest, and easiest way to evaluate systems entered into shared-task evaluation challenges, which are very common in NLP. But it is unclear how well they the results of such evaluations predict or correlate with human evaluations.
In other words, ultimately no one cares where a weather-forecast generator (for example) generates texts that get a high BLEU score, what we care about is whether the generator produces useful, accurate, and readable texts that genuinely help forecast readers. So we are only interested in BLEU scores if we believe that BLEU score predicts usefulness; ie, if we believe that a system that gets a good BLEU score will generate better texts than a system that gets poor BLEU scores. If this correlation does not exist, then BLEU scores are meaningless (I also discuss this issue in the “Rigour” section of an earlier blog post).
So do BLEU, ROUGE, METEOR (etc) correlate with the kind of human evaluations I discuss above? There is some evidence that there may be a weak correlation, which might justify the use of these metrics to quick initial feedback to developers, in order to help them improve their system (which is what BLEU was originally intended for). But I am not aware of any evidence that there is a strong and robust correlation which is immune to people “gaming” the metrics.
Hence I currently regard evaluations of NLG systems based on BLEU, ROUGE, and METEOR to be meaningless, and when reading or reviewing papers I ignore such evaluations. I will happily reconsider this if someone provides me with good “validation” evidence that a metric does indeed correlate with actual real-world utility.
How Should I Evaluate My NLG System?
So, what is the best way to evaluate an NLG system? There is a clear preference order
- (Best) Real-World Task-Based (Extrinsic)
- (Good) Laboratory Task-Based or Real-World Human Ratings
- (OK) Laboratory Human Ratings
- (Worst) Metrics
So you should perform the best evaluation which you can!
- If you can feasibly carry our a real-world task-based evaluation, do so.
- Otherwise, if you can feasibly carry out a real-world humans ratings evaluation, do so
- Otherwise, if you can feasibly carry our a laboratory task-based evaluation, do so
- Otherwise carry out a laboratory human ratings evaluation.
- Dont evaluate purely based on metrics unless you absolutely have no alternative.
Incidentally, while I personally believe that a real-world human-ratings evaluation is a bit better than a laboratory task-based evaluation, there are people I respect who take the contrary view, that a laboratory task-based evaluation is better than a real-world human-ratings evaluation. Good arguments can be made for both viewpoints.