I recently wrote a paper on a structured review of the validity of BLEU, where I brought together evidence from previously published studies on how well BLEU correlates with human evaluations. In my first draft of this paper, I included an analysis of how well BLEU correlated with human evaluations in different languages. By this I mean the language the output texts are written in; ie, the target language (not the source language) in an MT context. Note that while BLEU was originally developed for English, it is often used for other languages as well.
I ended up dropping this from the final version of the published paper, because space was tight and there are a lot of caveats about this analysis. So instead I am presenting some thoughts about this topic in my blog.
What does the data say
I present below the median values for BLEU-human correlation in different languages, for system-level evaluations of MT systems (which is what BLEU is best at), in the validation studies I surveyed. I only report this for languages where there are at least 10 data points (BLEU-human correlations) in my survey
Language Median correlation Num correlations in survey
Czech 0.85 10
English 0.79 107
French 0.76 21
German 0.44 23
Spanish 0.66 19
The most striking finding is that BLEU-human correlations are much worse for German than for any other language. There are some caveats (see below), but this does suggest that BLEU should not be used for evaluating German texts produced by an NLP system.
It is tempting to interpret the above data as telling us something about the language; ie, that there is something about the German language which makes it less suitable for BLEU than Czech, English, French, or Spanish. Perhaps, for example, BLEU is worse at German because German has relatively free word order compared to English and the Romance languagues; this could impact the reliability of BLEU, since BLEU is based on shared ngrams. But what about Czech, it also has a freer word order than English, etc?
However we need to be careful, because there are confounding factors. The above basically says that the correlations between BLEU and human evaluations in the validation studies I surveyed was worse for German than for the other languages I mentioned. This is a key caveat, because different MT systems are used in the validation studies for different languages. For example, if we look at the results of WMT 15 (one of the papers in my survey), we can see visually in Figs 4 and 5 that the correlation between BLEU and human scores is indeed worse for English-German than English-Czech, English-French, etc. But this is partially because the correlations include different systems. For example, there is a big mismatch between human evaluation and BLEU score for the ProMT rule-based translator, which is expected since we know that BLEU is biased against rule-based systems. But ProMT is only included in English-German (and English-Russian), it is not included in English-French or English-Czech. So perhaps one of the reasons for the lower overall BLEU-human correlation in English-German in this paper is just the presence of ProMT as one of the systems being evaluated?
Of course, this is just one paper. The problem only arises if validation studies for German systematically use more rule-based systems than validation studies for the other languages, which could for example happen if rule-based systems are more popular for MT into German. I dont know if this is the case (my survey did not go into this level of detail).
There are other confounding factors as well. For example, the validation studies of translation into Czech are more recent (median year 2013) than the validation studies of translation into German (median year 2008). Since overall correlation between BLEU and human validations is increasing over time, could median correlations for German be worse partially because they were done earlier?
Overall, it is clear that BLEU-human correlations are considerably worse for German than for Czech, English, French, or Spanish. But we dont know how much of this is intrinsic to the German language and how much is due to the above confounds. Perhaps a more detailed and comprehensive survey than mine could shed light on this issue.
Although qualitative linguistic analyses have fallen out of favour in the NLP community, I personally think such analyses could help here. In medicine, surrogate endpoints need a sound theoretical justification as well as high correlation with the primary outcome, in part because this gives a better understanding of when we expect the endpoints to work and when we think they may not work. A similar theoretical understanding of NLP’s “surrogate endpoints” (BLEU and other such metrics) would similarly help us understand where they should work and where they might not work, for example for different languages.
Overall, the numbers are clear, BLEU-human correlations in my survey are worse for German than many other languages. We do not know why this is the case, and indeed perhaps it is just due to the above confounding factors. But until this is clarified, BLEU should not be used to evaluate German texts.