Skip to content

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

  • Home
  • Blog Index
  • About
  • What is NLG
  • Publications
  • Resources
  • University
  • Book
  • Contact

Tag: metrics

Uncategorized

Small differences in BLEU are meaningless

Jul 28, 2020 ehudreiter6 Comments

I was very impressed by a paper we recently read in our reading group, which showed that small differences in BLEU scores for MT usually dont mean anything. Since lots of academic papers justify a new model on the basis of such small differences, this is a real problem for NLP.

Uncategorized

Why do we still use 18-year old BLEU?

Mar 2, 2020Mar 2, 2020 ehudreiter6 Comments

NLP technology has changed and advanced over the past two decades, but it often seems that NLG evaluation has not. Why is the 18-year old BLEU metric still so dominant?

Uncategorized

Evaluation Grand Challenge: Is NLP System Good Enough for a Use Case?

Feb 21, 2019Feb 21, 2019 ehudreiterLeave a comment

I was recently asked by someone if it was possible to easily determine whether an NLP system was good enough for a specific use case. Currently this is very hard. Making it easy could be a “grand challenge” for evaluation!

Uncategorized

How Would I Automatically Evaluate NLG Systems?

Jul 25, 2018Aug 7, 2018 ehudreiter1 Comment

Some musings on principled and theoretically sound techniques for automatically evaluating NLG systems.

Uncategorized

How to Validate Metrics

Jul 10, 2018Aug 7, 2018 ehudreiter4 Comments

My advice on how to perform a high-quality validation study, which assesses whether a metric (such as BLEU) correlates well with human evaluations.

Uncategorized

BLEU in Different Languages: Dont use it for German

Jun 20, 2018Aug 7, 2018 ehudreiter1 Comment

My structured survey of BLEU suggests that BLEU-human correlations are worse in German than in many other languages. But there are many caveats, so we need to be cautious in interpreting this result.

Uncategorized

BLEU-Human Correlation is Increasing: What does this Mean?

Jun 14, 2018Aug 7, 2018 ehudreiter6 Comments

The correlation between BLEU and human evaluations of MT systems seems to be increasing over time. Since BLEU has not changed, how is this possible, and what does it mean?

Uncategorized

Learning does not require evaluation metrics

May 30, 2018May 30, 2018 ehudreiter3 Comments

I was recently asked if machine learning requires evaluation metrics. The answer is no, and the fact that people are asking such questions suggests that some newcomers to the field may have a limited perspective on NLP research methodology.

  • LinkedIn
  • Twitter

News: I am likely to retire in summer 2026. Looking for interesting things to do afterwards.

Top Posts & Pages

  • Retirement Plans: Travel and some academics
  • What LLMs cannot do
  • Even good leaderboards may not be useful, because they are gamed
  • Types of NLG Evaluation: Which is Right for Me?
  • Hallucination in Neural NLG
  • Most common uses of AI in Healthcare
  • Blog Index
  • Do a sanity check on your experiments
  • Do LLM coding benchmarks measure real-world utility?
  • I'm very worried about data contamination
Blog at WordPress.com.
  • Subscribe Subscribed
    • Ehud Reiter's Blog
    • Join 102 other subscribers.
    • Already have a WordPress.com account? Log in now.
    • Ehud Reiter's Blog
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar