I recently wrote a paper on a structured review of the validity of BLEU, where I brought together evidence from previously published studies on how well BLEU correlates with human evaluations. One of my main conclusions was that BLEU was much better at evaluating MT systems than NLG systems. A few people have since asked me why I thought this was the case. Below are some thoughts; these are speculations rather than proven facts!
MT systems are getting better, but the output of a good MT system is still inferior to a human translation. NLG systems, in contrast, typically aim to produce texts of near-human, or even better-than-human, quality (eg, Reiter et al 2005). This is partially because there is little interest in using NLG to produce moderate quality texts, since these can be generated using templates.
BLEU is based on comparing computer-generated texts to human-written “reference” texts, and assumes that the closer the computer text is to the reference text, the better. This assumption is clearly incorrect if the computer-generated texts are *better* than the human-written reference texts! More generally, I suspect that any metric which is based on comparing computer-generated texts to human-written texts will be dubious if the computer texts are of near-human as well as better-than-human quality.
Information can be expressed in many different ways by an NLG system. To take a very simple, the below are all acceptable ways of describing a “purchase” event
Yesterday John bought a book at the bookstore.
John purchased a book at the bookstore yeserday.
The bookstore sold John a book on 1 July.
So even with this very simple message, we can express it in many ways by changing modifier (“yesterday”) placement, replacing words with synoyms (“bought” and “purchased”), changing temporal reference strategy (“yesterday” vs “1 July”), and paraphrasing (“John bought” vs “The bookstore sold”). So even this simple message can be expressed in dozens of ways. And a narrative which communicates ten messages can probably be expressed in thousands (millions?) of different ways.
This is a problem for BLEU, since it effectively is looking for matching ngrams in generated and reference texts. Even if multiple reference texts are provided, they are unlikely to cover all or even most of the above variations.
An obvious question is why this isnt also an issue for MT; after all, there are many acceptable ways of translating a sentence. I dont have a good answer to this, although I wonder if BLEU’s bias against rule-based systems is partially because their output is more variable than statistical/neural systems?
Variation to keep text interesting
In many contexts, human readers want texts to be varied; they do not want to see the same words and syntactic constructs repeated again and again. Hence varying the way information is communicated is appreciated by human readers, and increases their satisfaction; this is also standard advice to human writers. However, such variation *decreases* ratings from BLEU and other metrics, which tend to reward systems which are repetitive and use “preferred” wording and syntax 100% of the time.
I suspect this is a relatively minor issue compared to the previous ones, but I think it is interesting because it is a very clear example of a case where human preference is pretty much the opposite of BLEU’s preferences; systems that vary texts get higher human evaluation scores but lower BLEU scores.
Being very speculative, I suspect that MT systems have evolved to have good BLEU scores, since a good BLEU score is very important for research success in MT; I mean this in the Darwinian sense that approaches that provide good BLEU scores get more publications and funding than approaches with poor BLEU scores, regardless of their respective human evaluations. This one of the reasons why BLEU-human correlations for MT systems have increased over time. Good BLEU score has been much less important in NLG, so hence there has been less “evolutionary pressure” in NLG in favour of approaches that lead to poor BLEU scores.
If readers have other suggestions as to why BLEU is poorly suited to evaluating NLG systems, please let me know (or add a comment to this blog); I’m very interested in knowing other people’s thoughts on this!