I was very impressed by a paper we recently read in our reading group, which showed that small differences in BLEU scores for MT usually dont mean anything. Since lots of academic papers justify a new model on the basis of such small differences, this is a real problem for NLP.
NLP technology has changed and advanced over the past two decades, but it often seems that NLG evaluation has not. Why is the 18-year old BLEU metric still so dominant?
I was recently asked by someone if it was possible to easily determine whether an NLP system was good enough for a specific use case. Currently this is very hard. Making it easy could be a “grand challenge” for evaluation!
Some musings on principled and theoretically sound techniques for automatically evaluating NLG systems.
My advice on how to perform a high-quality validation study, which assesses whether a metric (such as BLEU) correlates well with human evaluations.
My structured survey of BLEU suggests that BLEU-human correlations are worse in German than in many other languages. But there are many caveats, so we need to be cautious in interpreting this result.
The correlation between BLEU and human evaluations of MT systems seems to be increasing over time. Since BLEU has not changed, how is this possible, and what does it mean?
I was recently asked if machine learning requires evaluation metrics. The answer is no, and the fact that people are asking such questions suggests that some newcomers to the field may have a limited perspective on NLP research methodology.