Skip to content

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

  • Home
  • Blog Index
  • About
  • What is NLG
  • Publications
  • Resources
  • University
  • Contact

Tag: BLEU

Uncategorized

Small differences in BLEU are meaningless

Jul 28, 2020 ehudreiter6 Comments

I was very impressed by a paper we recently read in our reading group, which showed that small differences in BLEU scores for MT usually dont mean anything. Since lots of academic papers justify a new model on the basis of such small differences, this is a real problem for NLP.

Uncategorized

Why do we still use 18-year old BLEU?

Mar 2, 2020Mar 2, 2020 ehudreiter3 Comments

NLP technology has changed and advanced over the past two decades, but it often seems that NLG evaluation has not. Why is the 18-year old BLEU metric still so dominant?

Uncategorized

Evaluation Grand Challenge: Is NLP System Good Enough for a Use Case?

Feb 21, 2019Feb 21, 2019 ehudreiterLeave a comment

I was recently asked by someone if it was possible to easily determine whether an NLP system was good enough for a specific use case. Currently this is very hard. Making it easy could be a “grand challenge” for evaluation!

Uncategorized

Hallucination in Neural NLG

Nov 12, 2018Nov 12, 2018 ehudreiter17 Comments

Many neural NLG systems “hallucinate” non-existent or incorrect content. This is a major problem, since such hallucination is unacceptable in many (most?) NLG use cases. Also BLEU and related metrics do not detect hallucination well, so researchers who rely on such metrics may be misled about the quality of their system.

Uncategorized

How to Validate Metrics

Jul 10, 2018Aug 7, 2018 ehudreiter3 Comments

My advice on how to perform a high-quality validation study, which assesses whether a metric (such as BLEU) correlates well with human evaluations.

Uncategorized

Why doesnt BLEU work for NLG?

Jul 2, 2018Aug 7, 2018 ehudreiter7 Comments

BLEU works much better for MT systems and NLG systems. In this blog I present some speculations as to why this is the case.

Uncategorized

BLEU in Different Languages: Dont use it for German

Jun 20, 2018Aug 7, 2018 ehudreiter1 Comment

My structured survey of BLEU suggests that BLEU-human correlations are worse in German than in many other languages. But there are many caveats, so we need to be cautious in interpreting this result.

Uncategorized

BLEU-Human Correlation is Increasing: What does this Mean?

Jun 14, 2018Aug 7, 2018 ehudreiter6 Comments

The correlation between BLEU and human evaluations of MT systems seems to be increasing over time. Since BLEU has not changed, how is this possible, and what does it mean?

Uncategorized

Is BLEU valid? First observations and concerns

Aug 8, 2017 ehudreiter2 Comments

The first phase of my systematic review of BLEU shows that BLEU-human correlations are all over the place, and that none of the studies in my review have correlated BLEU with real-world utility or user satisfaction.

Uncategorized

Study Design for Systematic Review of BLEU Validity: Comments Welcome!

Jun 13, 2017Jun 13, 2018 ehudreiter5 Comments

I’m planning to do a systematic review of the validity of BLEU, and am very keen to get comments and suggestions on study design from others!

  • LinkedIn
  • Twitter

Top Posts & Pages

  • ACL vs TACL Reviewing
  • Future of NLG evaluation: LLMs and high quality human eval?
  • "Will I Pass my PhD Viva"
  • Evaluating chatGPT
  • How to Validate Metrics
  • Publications
  • Bayesian vs Neural Networks
  • Good Papers are Hard to Publish
  • Blog Index
  • Unresponsive Authors and Experimental Flaws
Blog at WordPress.com.
  • Follow Following
    • Ehud Reiter's Blog
    • Join 83 other followers
    • Already have a WordPress.com account? Log in now.
    • Ehud Reiter's Blog
    • Customise
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar