The BLEU metric was introduced in 2002. In the 18 years since, Machine Translation (MT) and other aspects of NLP have changed radically; MT systems in 2020 work differently than MT systems in 2002, and produce much better translations. But MT evaluation has not changed; people still use the 18-year old BLEU metric.
So why has techology evolved but not evaluation techniques? Plenty of alternatives to BLEU have been proposed. WMT has a yearly “metrics” challenge where people propose new evaluation metrics, and every year many metrics are proposed which correlate better with human judgements than BLEU does. In recent years, WMT also has had a “quality estimation” track, which also has many interesting ideas for evaluation. But outwith these tracks, everyone still uses BLEU.
Similarly the ROUGE metric for summarisation was proposed in 2003, and it is still dominates summarisation evaluation despite the fact that the evidence that it is meaningful is much weaker than for BLEU.
So why is the NLP research community so reluctant to change its evaluation techniques, even when new techniques seem to be produce more meaningful results?
Science: Strong evidence base?
Perhaps we continue to use BLEU because there is a strong evidence base for its validity in predicting human judgements. My 2018 meta-analysis of BLEU’s validity summarised results from 284 correlations between BLEU and human evaluations, which were reported in 34 papers. This extensive evidence base allows us to use BLEU when it is appropriate and not otherwise. So perhaps researchers are reluctant to switch to different metrics and evaluation techniques because these other techniques lack this kind of evidence base?
If this was the reason for BLEU’s continued dominance, then the research community needs to have a discussion about how to solve the “chicken and egg” problem where we don’t build up evidence of the validity of new metrics because people refuse to use them.
However, I personally dont believe the existence of a good evidence base is why people continue to use BLEU. This is because (A) many researchers ignore the evidence base and use BLEU in contexts such as NLGand German-English MT where the evidence base says that BLEU should not be used; and (B) ROUGE is dominant in summarisation evaluation despite the fact that it’s supporting evidence base is much weaker than BLEU’s.
Gamesmanship: We want a scoring function, not evaluation of utility?
Perhaps the real reason for the continued use of BLEU is that the research community isnt actually very interested in evaluating (or predicting) how useful an NLP system or model is. Instead, what it wants is a simple “scoring function” which enables researchers to publish endless papers about how their system does 1% better on a data set.
To put it crudely, perhaps NLP researchers focus on winning contests where they show that their model gets a better score than other models. And they dont really care whether the score is meaningful or not, they just take the scoring function as a given (like the scoring function in a computer game) and try to beat the other players in the contest.
If this is the case, then we would expect that whenever a new series of contests open up, whatever scoring function is chosen for these initial contests will persist and become established as the proper way to keep score in such contests. Indeed, attempts to change the scoring function on the basis that it is not “realistic” will be resisted by established players who have put a lot of time and energy into understanding the original scoring function.
In other words, BLEU is used because it was the first plausible metric for MT, and researchers dont want to change to other metrics because they understand and are comfortable with BLEU.
I suspect there is a lot of truth to this perspective. It certainly does explain why it is so difficult to dislodge established metrics even if new ones are proposed which are much better at predicting utility or human judgements.
Implications for Evaluation Research
I’m involved in a few different initiatives about evaluation, and I must admit that the above worries me. After all, there isnt much point in developing great new evaluation techniques if the research community will refuse to use them!
Most of my efforts are focused on human evaluation, and I think people who conduct human evaluations are interested in new ideas and understanding what constitutes best practice (eg van der Lee et al 2019). But I am beginning to wonder if it is almost impossible to get the NLP community to use a new BLEU-like metric for an established task.