A metric-based evaluation give an NLG system a score by computing how similar its output text is to “gold-standard” reference texts. There are a number of different metrics (including BLEU, METEOR, and ROUGE), which are based on different scoring functions.
I am not a great fan of metric-based evaluation, for reasons I explain below, and would be very dubious if, for example, I was asked to review a paper on NLG which only presented a metric-based evaluation. Nevertheless, I will also below give some advice on best practice for such evaluations.
Why I am Dubious About Metric-Based Evaluation
I have written about this in other blog entries, including Evaluation in Medicine and NLG/NLP and Types of NLG Evaluation: Which is Right for Me?. But needless to say, I wont pass up an opportunity to express my views once more…
Evaluation is a form of hypothesis testing. In NLG, we are usually interested in hypotheses about utility (eg, the NLG system helps people do something), likability (eg, people love the texts generated by the NLG system), or readability (eg, texts produced by this algorithm are read quickly with good reading comprehension). I’ve also occasionally seen papers which test purely computational hypotheses (eg, this algorithm is fast) or cognitive modelling hypotheses (eg, this algorithm mimics what human speakers do).
Metrics are of interest if they can approximate or predict the above kinds of hypotheses. In other words, we ultimately care about whether an NLG system produces high-quality texts, not what its BLEU score is. BLEU (etc) scores are only valuable if they can reliably predict the result of testing the hypotheses we care about (utility, etc).
Metrics such as BLEU are analogous to surrogate endpoints in medicine, such as testing whether an AIDS medication reduces viral load instead of testing whether it leads to longer life or higher quality of life. What we really care about is whether the medication helps people live longer and better, but this is time-consuming to measure (since we have to wait until people die), so some studies simply check whether the medication reduces HIV viral load in the patient. Thus, viral load is a surrogate endpoint which is easy to measure and we believe usually predicts the difficult-to-measure “primary” endpoint which we are really interested in, such as longevity. Surrogate endpoints should predict and not just correlate with the primary endpoint, and there should be a medically plausible explanation for why they work (as is the case with viral load). Perhaps most importantly, medical researchers acknowledge that surrogates are not always accurate and it is essential to also have studies which directly measure the real endpoint (eg, mortality after 5 years) and verify the accuracy of the surrogate (eg, viral load).
From this perspective, one would expect that NLP metrics such as BLEU should be supported by strong validation studies which demonstrate that BLEU (etc) predicts the things we actually care about (such as utility and readability), and also clearly state the circumstances where BLEU is not a good predictor. Furthermore, the highest-quality papers should directly measure utility, etc, instead of relying on BLEU, as is the case in medicine.
So what is the reality? In NLG, the most thorough validation study I am aware of is Reiter and Belz 2009, which tried to correlate a number of metrics (including BLEU and ROUGE, but not METEOR) with the result of a laboratory human ratings evaluation (which is the weakest form of human evaluation). The correlations were not especially good, and at best suggest using BLEU for quick feedback to developers (which in fact is what it was originally proposed for), not for proper hypothesis-testing. As far as I know, other validation studies of metrics in NLG (such as Belz and Gatt 2008) have also failed to demonstrate the kind of correlation and indeed predictive accuracy we would expect to see in a surrogate endpoint.
Hence we do not have good evidence that metrics predict the outcomes we care about. Indeed the evidence so far, at least in NLG, suggests that at best metrics weakly predict or correlate with utility, readability, etc; and at worst they are completely uncorrelated with these things. Since we dont have such evidence, we should not use metrics to evaluate NLG systems. I am not an expert on machine translation or summarisation, but I have looked for evidence of metric validity in these fields, and have not been impressed by what I have found.
OK, But How Should I Perform a Metric Evaluation?
But what if someone really does want to perform a metric-based evaluation, what advice can I give on doing this in the “least bad” manner?
Do a human-based study as well
If you want to use metrics, I would strongly advice doing at least a small human-based evaluation as well. If the human-based evaluation agrees with the metric-based evaluation, then this gives readers more confidence that the metric-based evaluation is meaningful. And if the human-based evaluation does not agree with the metric-based evaluation, then you should understand why this is the case before you publish your results.
Compare systems built with similar technologies
Experience in machine translation shows that BLEU is biased towards statistical phrase-based systems and against rule-based systems. In other words, if we take a statistical MT system and a rule-based MT system which produce texts of similar utility, readability, etc when assessed in human evaluations, then the statistical MT system will almost certainly get a significantly higher BLEU score than the rule-based system.
BLEU’s bias against rule-based MT systems at least is well known and acknowledged. What concerns me is that BLEU probably has other biases as well, which we dont understand and acknowledge; the situation is similar with other metrics. Until we do have a good understanding of these biases, we should minimise their impact by only using metrics to compare systems built with similar technologies.
In particular, we should never use metrics to evaluate completely new approaches to MT or NLG, because biases are likely, and we have no idea a priori what these biases are.
Only use metrics to measure average-case performance
Metrics are intended to measure how a system works on average. Which is fine if this is what we care about. However, in many cases we are also concerned about worst-case performance. Ie, we want to guarantee to our users that our system is guaranteed to perform at a certain level in all cases. In Babytalk, for example, we generated texts summarising clinical data about premature babies in intensive care. In this project, we wanted the texts to be useful in the average case (ie, improve care), but it was also essential that the texts never be harmful (ie, damage care).
All metrics I have seen are useless at evaluating worse case behaviour, which means they are useless at guaranteeing that the system always performs at a certain quality level. So if you care about worst-case behaviour, you will need to evaluate this using a different technique.
Multiple high-quality reference texts
All metrics work by comparing an NLG text to one or more reference texts for the same data (scenario). The closer the NLG text is to the reference texts, the higher it will score. What this means is that we need high-quality reference texts, and also we should have several alternative reference texts, since there are usually many ways of expressing information in a text.
To take a really simple example, assume we are producing weather forecasts, and the data shows that over the course of a day the temperature rises from 10C at 0000 to 20C at 1200 and then falls back to 10C at 0000. There are many ways in which this information could be communicated, including
- “temperature rising from 10 to 20 at noon, and then falling back to 10“
- “maximum temperature of 20 at noon“
- “nice day with a noontime high of 20, but chilly in the evenings“
If all of these are acceptable, then we need to include them all as reference texts. If we just include the first, and the system generates the third, then the metric will give a poor score because text (3) is very different from text (1). If text (3) is aceptable (which is the case in many contexts), then the metric is giving the wrong answer. Hence our reference texts need to cover the spectrum of acceptable texts.
The reference texts also need to be accurate and well-written. For example, we dont want “temperature is pretty boring today” as a reference text for this data set, because it is (probably) inappropriate. But if we get reference texts from Mechanical Turk (for example), this kind of thing is a real possibility.
Sometimes we get reference texts from existing corpora, for example we might use historical human-written weather forecasts as reference texts. The problem with this is that some human-written weather forecasts are pretty bad. So if historical texts are used as reference texts, I strongly recommend vetting them first to ensure they are of high quality.
In short, if we want to use metrics, we should create a high-quality set of reference texts, with multiple texts for every scenario so we cover different acceptable ways of communicating the information. Of course, creating such reference texts is an expensive process, especially if we use domain subject matter experts (eg, meteorologists) instead of Turkers, students, or colleagues.