Last week we read an excellent paper in our reading group: Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics (https://www.aclweb.org/anthology/2020.acl-main.448.pdf), by Mathur, Baldwin, and Cohn. The paper makes a lot of good points, but one that really struck me was that small differences in evaluation metrics such as BLEU are probably meaningless. Which is striking since ACL and other “selective” and “prestigious” venues are happy to accept papers on the basis of small improvements in metric scores.
Only big differences in metric scores are meaningful in MT
Mathur et al use data from WMT, which is a long-running annual machine translation event where (amongst other things) a bunch of MT systems are evaluated using both human evaluations and metrics (including BLEU). Overall, metric scores usually correlate reasonably well with human evaluations in MT (with some caveats), which supports the use of metrics as proxies for human evaluation in machine translation (not in NLG!!).
Anyways, the authors look at WMT data, and point out that WMT evaluations include systems with very different quality levels. They then point out that the Pearson correlation used to compare human evaluations to metric evaluations is primarily driven by large differences and outliers. Ie, as long as a metric such as BLEU can reliably distinguish MT systems which people think are excellent from MT systems which people think are dreadful, then the metric will show a high Pearson correlation with human evaluations.
However, in academic contexts we are usually interested in small differences in quality (eg, is a proposed model slightly better than state-of-art), and Mathur et al show that BLEU is **not** good at predicting the result of human evaluations when the difference in BLEU scores is small. They essentially compute how well differences in BLEU scores of two systems predict differences in human evaluation, and conclude that
- If System A has a BLEU score that is 1-2 point higher than System B (common in academic papers), then there is only a 50% chance that human evaluators will prefer System A over System B
- If System A has a BLEU score that is 3-5 points higher than System B, there is a 75% chance that human evaluators will prefer A over B.
- In order to get a 95% chance that human evaluators will prefer A over B, we need something like a 10 point improvement in BLEU (they dont state this, I am guessing this by eyeballing their graphs).
Mathur et al look at several other metrics as well, and find the same pattern. Across the board, a large difference in a metric score between two systems is probably meaningful (ie, if MT system A has a much higher metric score than MT system B, humans evaluators will probably rate A higher than B), but a small difference is not.
Inappropriate use of BLEU and other metrics
The reason this is a problem is that a lot (most?) academic papers in NLP justify that a proposed model or algorithm is better than state-of-the-art on the basis of quite small differences in metric scores. It is very rare, at least in my experience, to see a paper which shows a 10-point improvement in BLEU over state-of-the-art, which (as above) seems to be what you need in order to be 95% confident that your proposed model would genuinely be seen by users as an improvement over state-of-art.
In short, there are contexts in machine translation where BLEU and other metrics can serve as plausible proxies for human evaluation. However, the typical academic use of metrics (as above) is ***NOT*** one of these contexts; it is not scientifically valid to claim that a new model is better than state-of-the-art because of a small difference in metric score.
Wish list for the future
What I would love to see in the future is the following.
- All metrics are carefully characterised so that we know when they reliably predict human evaluations and when they do not. In particular, there is clear guidance about how much of a difference in metric score is needed to give confidence that the systems being compared are truly different.
- Researchers and paper authors only use metrics to justify claims when the criteria in (1) are met. Reviewers reject papers which make claims that are not justified under the criteria.
- High-quality human evaluations are common and indeed expected for top-rank papers. In MT, the expectation is that such human evaluations will be at least as good as those done in WMT.
Perhaps I am an optimist, but I do think that we are slowly moving in the above direction. It will take time, but hopefully we will see real change over the next 5-10 years.