One of my projects this summer is to do a systematic review of the validity of BLEU. In other words, I am reviewing the research literature in a way which is supposed to be objective and repeatable, to identify and summarise studies which assess whether BLEU scores correlate with human evaluations.
I have now identified the studies and extracted key information from each study (using the protocol described in my earlier blog entry). The most obvious observation is that correlations between BLEU and human evaluations are all over the place. Some studies report a near-perfect correlation, other studies report a correlation which is negative (ie, worse than random guessing), and most are somewhere in between.
What Factors Influence BLEU-Human Correlation?
An obvious question is whether we can predict whether BLEU-human correlation will be good in a specific context. In general, if we look at studies in terms of systems and type of BLEU scoring, BLEU-human correlation seems to be higher when
The systems being evaluated [are]
- built with similar technology
- quite different in quality (as assessed by humans)
- produce texts in English (instead of Chinese, Arabic, etc)
- operate in “everyday” domains such as news, not specialist technical domains such as biomedicine.
- MT (instead of NLG, dialogue, etc)
BLEU scoring is done
- based on complete systems, not individual texts
- using multiple reference texts
- with good tokenisation and other low-level processing
Above is probably neither surprising nor novel, but seems to be ignored in practice. Ie, I’ve seen plenty of papers that present BLEU scores in contexts which dont match the above criteria.
Anyways, though, the above criteria on their own are nowhere near sufficient to explain the variation in BLEU-human correlations reported in the literature. I think the biggest impact on correlation is the human study. In other words, some human studies correlate much better with BLEU than other human studies. Which again is not a new observation. To quote Coughlin 2003
“One of the most interesting conclusions of our study, though, is that the superior linguistic skills of human raters are not exploited by MT evaluation tasks that involve quickly comparing a machine-translated sentence to a human-translated reference. Instead, human raters faced with the task of making quality judgments on technical and non-technical, out-of-context prose appear to rely on superficial, string-based criteria. In other words, they behave like expensive, slow versions of BLEU.”
In other words, if we structure the human evaluation so that the human evaluators are forced to evaluate texts in a BLEU-like fashion (“superficial string-based criteria” in Coughlin’s terminology), then we should see a good correlation with BLEU. But is such a human evaluation meaningful? Ie, ultimately we want to predict whether an NLP system will be useful in real-world contexts. Does a human evaluation carried out along the above lines (where evaluators are forced to use “superficial string-based criteria”) actually predict real-world utility? If not, then showing BLEU correlates with such studies doesnt tell us anything. To quote a psychology textbook (Kaplan and Saccuzzo 2001), “a meaningless [test] which is well correlated with another meaningless [test] remains meaningless.”
No studies have looked at real-world utility or user-satisfaction!
Following this up, perhaps the most disappointing finding for me in my review is that **none** of the studies correlated BLEU score with real-world utility or user satisfaction. Most of the studies correlate BLEU scores with human ratings in artificial contexts, with a few exceptions looking at task performance in artificial contexts (usually the amount of effort required to post-edit computer output texts into texts of acceptable quality). But we know that ratings and even task performance in artificial contexts are not necessarily good predictors of real-world utility or user-satisfaction. And none of the studies in my review address this point, indeed few even acknowledge it.
To put this another way, if we want to test the hypothesis that BLEU predicts real-world utility or user satisfaction, then we need to correlate BLEU scores with measurements of real-world utility or satisfaction. We cannot test this hypothesis by correlating BLEU scores with ratings from Turkers who know nothing about domain, context, etc; or even by correlating BLEU scores with the effort required to post-edit texts in artificial contexts. Unless we have a priori evidence that Turker ratings or post-edit effort in artificial contexts correlate with real-world utility or user satisfaction; however none of the studies in my review present or cite such evidence.
It is not easy, cheap, or quick to measure real-world utility or user satisfaction, but it can be done. For example, we could use A/B testing on websites which offer MT or NLG services (weather forecasts?), where different users are given access to different systems, and we measure how likely users are to use the service again, or indeed explicitly ask users to rate their satisfaction. If we want to assess utility/satisfaction in repeated professional users (instead of casual users), we could recruit a number of such users as subjects, ask each user to use the systems being evaluated for a non-trivial amount of time (perhaps every user tries each system for a month?), and measure both productivity and user satisfaction.
I suspect that only reasonable-quality systems can be tested this way, since a provider of NLP services will not want to put its reputation at risk by providing its customers with rubbish services as part of an A/B testing exercise, and professionals who rely on an NLP service may refuse to use a poor system. So we will need to filter out low-quality systems before we start the experiment, and also be ready to pull systems which we thought were OK, but proved problematical in practice. This is perhaps analogous to what happens in medicine, where a new drug has to clear many hurdles before it can undergo a full Phase 3 clinical trial, and also the trial will be stopped if serious patient safety concerns emerge.
The above means that a real-world utility/satisfaction evaluation probably can **not** be carried out as part of a shared task evaluation (such as WMT), since shared task evaluations need to be cheap, quick, and applicable to all submissions.
So, if we genuinely want to assess whether BLEU predicts real-world utility/satisfaction, we need to gather good data on real-world utility/satisfaction. And for a good validation study, we need data on a number of systems (5 at absolute minimum, 10 would be much better) which perform the same task (eg, French->English MT, or weather forecast generation). As above, gathering this data is going to require serious effort; but it is the only way to really answer the question of whether BLEU scores predict real-world utility and/or user satisfaction.