My structured survey of BLEU suggests that BLEU-human correlations are worse in German than in many other languages. But there are many caveats, so we need to be cautious in interpreting this result.
The correlation between BLEU and human evaluations of MT systems seems to be increasing over time. Since BLEU has not changed, how is this possible, and what does it mean?
The first phase of my systematic review of BLEU shows that BLEU-human correlations are all over the place, and that none of the studies in my review have correlated BLEU with real-world utility or user satisfaction.
I’m planning to do a systematic review of the validity of BLEU, and am very keen to get comments and suggestions on study design from others!