I recently wrote a paper on a structured review of the validity of BLEU, where I reviewed and summarised previously published “validation” studies which measured how well BLEU correlated with human evaluations of NLP systems (or texts). In such studies, a number of NLP systems (or individual texts generated by NLP systems) are evaluated both using a metric and a “gold standard” human evaluation, and the correlation between the metric and the human evaluation is published. A high correlation gives us confidence that the metric is meaningful.
It was clear from my review was that some validation studies were better and more meaningful than others. In this blog I look at some of the criteria that distinguish a good validation study from a poor one. The goal is to educate both people who do validation studies (so they do them better), and people who read validation studies (so that they can distinguish the good ones from the bad ones).
I realise this blog is quite technical in places, feel free to contact me if you have questions!
Output of NLP Systems: The study should assess correlations between metric and human evaluations of NLP systems. Some papers, especially in the image captioning community, assess the correlation between metric and human evaluations of human-written texts (often from a crowd-sourcing platform such as Mechanical Turk). This is not good practice, since its not clear that correlations (between metrics and human evaluators) on human-written texts will be the same as correlations on texts produced by NLP systems. Sometimes validation studies are based on a mix of computer-generated and human-written texts (ie, human texts are one of the “systems” being evaluated). This isnt ideal, but is probably acceptable provided that most of the texts being evaluated are computer generated.
Varied Systems: We know that BLEU is biased against some technologies (eg, rule-based systems), and it seems likely that other metrics are biased as well. For this reason, its really useful if the NLP systems being evaluated use a variety of different technologies. If a validation study only looks at LSTM systems, for example, then its validity results only apply to evaluations of LSTM systems.
Size: The study needs to be large enough to show statistically significant results. At an absolute minimum, you should have at least 5 things being correlated (because 5 data points are the minimum needed to produce a statistically significant Spearman correlation). Ie, at least 5 NLP systems if the correlation is at the system level, or at least 5 different texts if the correlation is at the text level. Of course, more than 5 is better!
Good human evaluations: The human evaluation in a validation study is supposed to be a high-quality “gold standard” evaluation. One of my biggest frustrations with existing validation studies is that the great majority use human evaluations based on human ratings in artificial context, which is the weakest type of human evaluation. I dont think I’ve ever seen a validation study which used extrinsic (task) performance in real-world context, which is the strongest type of human evaluation. And even ignoring this issue, many of the human evaluations are not well executed (see my recommendations on how to do human ratings evaluations).
Correlation: There are many ways of assessing how well metrics agree with human studies. Regardless of the extrinsic merits of these techniques, the “standard practice” in the field is to assess agreement with some type of correlation (Pearson, Spearman, Kendall). So if you publish a validation study, please include one of these correlations. If you think there is a better way of assessing agreement, you can include this as well, but this should be in addition to (not instead of) a correlation. I personally prefer Spearman correlation, but other people prefer Pearson or Kendall; good arguments can be made for all of these (and of course it depends on part on the experimental design).
Interannoter agreement: It is really useful to report anter-annotator agreement in the human evaluation, if possible. In other words, if multiple people evaluated the same text, how consistent were their evaluations? This is usually done by presenting a kappa score, although again this depends on the experimental design; eg if we are measuring a numerical outcome such as reading speed, it probably makes more sense to report standard deviation.
Statistical significance: I strongly encourage people to report statistical significance of correlations. Almost all statistical packages produce a statistical significance value as well as a correlation coefficent, please at least report this. If you are comparing different metrics to see which correlates best with human judgement, Graham and Baldwin recommend using a Williams test.
Clear reporting of material, procedure, analysis: You should clearly report material, procedure, and analysis in your writeup. This should include everything mentioned above, and also everything mentioned in Figure 3 of my review paper. And please give numbers, dont just show graphs! If you dont have enough space in your paper to describe the above, then write and archive (eg on Arxiv) a technical report which gives these details.
Archive data: Please create and publish an archive of the data from your validation study. This will really help people (like me) who want to do meta-analyses of validation experiments.
Originality: A fundamental rule of meta-analyses (including structured surveys) is that results should not be counted twice. So if your paper in places re-presents results published in earlier work, please make this very clear! Likewise if your paper presents an improved or extended analysis of data which has already been presented elsewhere, please make this very clear.
The validation studies conducted as part of WMT, such as Bojar et al 2016, are generally very good, and meet almost all of the above criteria.
An example of a recent validation study which is more problematic (it only meets around half of the above criteria) is Sharma et al 2017 (which was not in my survey, since it is not in the ACL Anthology).