Study Design for Systematic Review of BLEU Validity: Comments Welcome!

[26-June: Archival Status added as an issue, at end of blog]

[20-June: Human Agreement added as an issue, at end of blog]

[16-June: Representative Systems added as an issue, at end of blog]

[14-June: Sentence Level Correlation added as an issue, at end of blog]

One of my projects over the summer is to do a systematic review of the validity of BLEU.  In other words, I want to find published peer-reviewed papers which test whether BLEU scores correlate with human evaluations (I discuss this in a previous blog, How to do an NLG Evaluation: Metrics), and present their findings in a structured fashion.   This kind of exercise is common in clinical medicine, where it is used to guide policy making by summarising evidence from a number of studies.  However, I don’t believe any systematic reviews have been done in NLP;  I did a search on “systematic review” in the ACL anthology , and the only hit was a paper on using NLP to help conduct systematic reviews, there were no papers presenting systematic reviews of NLP.

A systematic review is different, and weaker, than a meta-analysis.  In a meta analysis, we combine the results of several small experiments and treat them as if they were one large experiment.  Eg, a meta-analysis would analyse 5 experiments with 100 subjects as if these were part of a single experiment with 500 subjects.  A meta-analysis is only possible if the component experiments have a very similar design, and I doubt this will be the case with validation studies of BLEU.  Hence I will attempt a systematic review, following the PRISMA flow diagram.

Below I give my study design.  I am doing this mostly to get feedback and suggestions for improving the design (since this is new in NLP), and also because “best practice” in such studies is to formally publish the design before undertaking the study.

Comments and suggestions are very welcome (email, especially before I start the actual study!

Research Question

Can BLEU be used to determine which NLP systems are most useful to (or most preferred by) human users?

Database searching (ie, finding potentially relevant papers)

I intend to do a title search on the ACL anthology , looking for the following keywords

  • BLEU
  • NIST                               (this is to find papers about NIST’s variant of BLEU)
  • automatic human
  • metric human
  • automatic validity
  • metric validity
  • intrinsic extrinsic
  • evaluating evaluation

Title search on the ACL Anthology most certainly does not pick up all of the relevant papers (ie, recall is a lot lower than 100%).  I investigated using full-text search instead of title search, and Google Scholar instead of ACL Anthology, but both of these returned a huge number of false negatives.  For example, a title search on “BLEU” in ACL Anthology returns 31 papers, of which at least 5 should be relevant to my review.  A Google Scholar title search on BLEU returns 4370 papers, and an ACL Anthology full-text search on BLEU returns 3810 papers, of which I suspect at most 10 would be relevant to my study.

It is a pity that the ACL Anthology does not support abstract and keyword search, as this would raise recall in this kind of endeavour without totally destroying precision.

Question: are there other keywords I should search on and/or other databases?  Also let me know about individual papers which you think might be relevant.  Papers must be properly archived, see Archival Status below in the Issues section.

Screening/exclusion Criteria

I will exclude papers not written in English, since this is the only language I know.

In order to be included in my study, the paper must present both BLEU and human evaluation scores for a number of NLP systems, and calculate a numerical correlation.  The evaluation must be of systems, not individual texts, since the research question is about whether BLEU correlates with human evaluations of systems.

I will also exclude papers that correlate evaluation scores amongst 4 or fewer systems.  Ideally this criterion would be based on a power calculation, but it is very rare for an ACL paper to present a power calculation.   5 pairs of points (points are [BLEU score, human score] for an NLP system) are the minimum needed to have a chance of showing a statistically significant (p<0.05)  Spearman correlation, so I will exclude studies with fewer than 5 systems.  For this purpose, I will allow papers to include human translations as “systems”, provided these human translations are not also used as reference texts.  For example, the original BLEU ACL paper evaluated 3 MT systems and 2 human translators; I will accept this.

Question: are these exclusion criteria too strict?  Also see Sentence Level Correlation below in the Issues section.

Full-text eligibility criteria

I may also exclude a small number of papers for other reasons, such as presenting results which have already been presented elsewhere (eg, journal version of a conference paper).  Each experiment should only be reported once.  All such exclusions will be documented in my write-up.

If a paper presents multiple validation experiments, these will be shown separately in the systematic review.  In other words, the review is of experiments, not of papers.


For each eligible paper/experiment, I will extract the following information.

[NLP systems in the study]

  • Type of system: MT, NLG, …
  • Domain: newswire, weather forecast, …
  • How many systems in study: 5, 10, …
    • NOTE: As above, studies with fewer than 5 systems will be excluded
  • Technologies used in systems: statistical, rule-based, …
    • NOTE: included because previous research suggests BLEU validity is highest when comparing statistical systems to other statistical systems. See Representative Systems below in the Issues section.

[BLEU details]

  • Type of BLEU used: BLEU-4, NIST, …
  • Number of reference texts/translations for each input: 1, 2, 4, …
  • Source of reference texts/translations: naturally occurring, Mechanical Turk, …

[Human Evaluations]

  • Type of human study: task/outcome based; ratings where subjects sees both input and output; ratings where subjects see output but do not see or cannot read input
    • NOTE: Last category includes MT evaluations where subjects are shown the MT output and a reference translation, but not the source language text; and MT evals where subjects are shown source language text but cannot read it because they do not know the source language.
  • Type of rating: fluency, adequacy, overall, n/a
  • Scale of rating: 5-pt Likert, …
  • Do subjects understand the texts they are reading (eg, have relevant domain knowledge if required): yes, no
  • Interannotator agreement between humans: unknown, 0.3, …
    • See Human Agreement below

[Calculation of Correlations]

  • Type of correlation: Pearson, Spearman, …
    • NOTE: Spearman is more statistically robust, and a better fit to the research question
  • Significance analysis: none, 1-tailed p value, 2-tailed p value
  • Multiple-hypothesis correction applied: none, Bonferroni, …
    • NOTE: For example, a paper which look at the correlation of BLEU-4 and NIST with human adequacy and fluency judgements would be testing 4 hypotheses (BLEU-4 vs human adequacy, BLEU-4 vs fluency, NIST vs human adequacy, NIST vs fluency).   Are significance results adjusted for this?
  • Potential bias: developer of metric being validated, developer of competitor metric


  • Correlation: eg, Bleu-4 has 0.85 correlation with human fluency ratings
  • Statistical significance of correlation: not calculated, p < .05, p > .05,  p < .01, …
  • Comment:


Question: is there other information I should try to extract from the papers?


I will present the above analysis data in a table or spreadsheet, and discuss what I think we can learn from this.  Ideally I would conclude by saying “BLEU is a good way of assessing system utility” or “BLEU is a poor way of assessing system utility”, but I strongly suspect that the picture will be considerably more nuanced.  At any rate, the systematic analysis should clarify what we do (and do not) know about this topic.

Appendix: Example Analysis

Below is how I would analyse the validation experiment presented in the original BLEU paper, Papineni et al 2002

  • Type of system: MT
  • Domain: newswire
  • How many systems in study: 5; 2 “systems” are actually human translations
  • Technologies used in systems: not specified
  • Type of BLEU used: BLEU-4
  • Number of reference texts/translations for each input: 2 (in validation study)
  • Source of reference texts/translations: not specified
  • Type of human study: Monolingual Group: ratings where subjects see output but cannot read input; Bilingual Group: ratings where subjects see both input and output
  • Type of rating: overall
  • Scale of rating: [1,2,3,4,5]
  • Do subjects understand the texts they are reading (eg, have relevant domain knowledge if required): yes
  • Interannotator agreement between humans: unknown
  • Type of correlation: Pearson
  • Significance analysis: none
  • Multiple-hypothesis correction applied: none
  • Potential bias: developer of metric being validated
  • Correlation: 0.99 (Monolingual Group), 0.96 (Bilingual Group)
  • Statistical significance of correlation: not calculated
  • Comment: If the human translations were excluded from the validation study, Spearman rank correlation would certainly not be statistically significant

Issues Raised

Sentence-level correlation: It has been pointed out to me that some validation studies use fewer than 5 systems, but can nevertheless achieve statistical significance because they correlate BLEU/human scores at the text level instead of the system level.  For example, assume we have two MT systems, MT1 and MT2, which are evaluated on 100 source-language texts.  If we compute BLEU scores for MT1 and MT2 and correlate with human evals of MT1 and MT2, then significance is impossible.  However, if we compute a BLEU score for each of the 200 translations (MT1-text1, MT2-text1, MT1-text2, … MT2-text2) and correlate with a human evaluation of each of these 200 translations, then we can certainly achieve statistical significance.

This is definitely a valid point, but I feel uneasy with this approach/design (sentence-level correlation).  BLEU was designed to evaluate systems, not individual translations, so I think it should be assessed on how well it evaluates systems (which is also the research question).  It is arguably unfair to BLEU to assess it on the basis of how well it evaluates individual translations (or NLG texts), since this is not what it was designed to do.

Representative Systems: Someone pointed out to me that proper testing of the research question requires that the system being evaluated in the validation study are representative.  In other words, if we want to show that BLEU correlates with human evaluations when evaluating MT systems, then we should randomly select MT systems to participate in the validation study, or otherwise ensure that they are representative.   As far as I know this has never been done or indeed even been seriously considered.

This matters because using unrepresentative systems may lead to false conclusions about correlation between BLEU and human evaluations.  For example, we know that BLEU is biased against rule-based systems; eg if BLEU is used to evaluate a rule-based and a statistical system which are of equivalent utility to human users, the statistical system will almost certainly get a much higher BLEU score.  Therefore a validation exercise conducted purely on statistical systems would probably show a much higher BLEU-human correlation than a validation exercises which was conducted on a representative mix of systems built using different technologies.

Anyways, this is again a very valid point, but the purpose of my systematic review is to gather information (including technologies used in systems) in existing validation studies, not conduct a new validation study.  Hopefully the information I gather will give the community a better understanding of how BLEU-human correlation changes in different contexts, including the type of system evaluated and also the domain.

Human Agreement: I have started reading some of the ACL SIGMT workshop proceedings.   Most of these report how well human raters/evaluators agreed with each other, typically as a kappa, and I was somewhat shocked at how low many kappa agreement values were.  Kappa below 0.4 was very common, and I even saw kappa below 0.2 in some cases.  Low kappa values raise concerns about the robustness of the human evaluations (eg, would we get the same human evaluations with different evaluators).

Anyways, I have decided this information is important in understanding what we can learn from metric validation studies, so I will include it when available.

Archival Status: I have asked for suggestions of relevant papers, many thanks to everyone who has contrinbuted.  However, I am struggled because a few very relevant studies are not properly published in an archival venue such as ACL Anthology.   For example, the NIST MetricsMaTr08 workshop is certainly relevant, but its hard to find, because its just a URL, which has changed over the past 10 years (eg, a lot of links to MetricsMaTr08 from published papers are now dead).

This is a problem because a systematic review is supposed to be reproducible, but this means that the underlying papers need to be easy to find in a stable archive, in 2022 and indeed 2027 as well as 2017.  Unfortunately MetricsMaTr08 does not meet this criteria, so I think I will exclude it.

3 thoughts on “Study Design for Systematic Review of BLEU Validity: Comments Welcome!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s