Skip to content

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

  • Home
  • Blog Index
  • About
  • What is NLG
  • Publications
  • Resources
  • University
  • Book
  • Contact

Tag: evaluation

Uncategorized

Many Papers on Machine Learning in NLP are Scientifically Dubious

Jun 6, 2018 ehudreiter1 Comment

In response to a previous blog, many people expressed concerns to me about the quality of many papers they saw on ML in NLP. I summarise some of these concerns, which are worrying.

Uncategorized

Learning does not require evaluation metrics

May 30, 2018May 30, 2018 ehudreiter3 Comments

I was recently asked if machine learning requires evaluation metrics. The answer is no, and the fact that people are asking such questions suggests that some newcomers to the field may have a limited perspective on NLP research methodology.

Uncategorized

Please Use Two-Tailed P Values!

Jan 29, 2018 ehudreiter1 Comment

If you are writing a scientific paper which presents statistics, please use two-tailed p values unless you **really** know what you are doing.

Uncategorized

My Guidelines for Evaluating AI Systems

Nov 21, 2017 ehudreiterLeave a comment

Ehud’s guidelines for evaluating AI systems: keep it simple, keep it ethical, be careful, do proper stats, and be skeptical

Uncategorized

Testing Multiple Hypotheses

Oct 27, 2017 ehudreiter3 Comments

The NLP/AI community needs to do a better job of dealing with multiple hypotheses, otherwise a lot of our results will be garbage.

Uncategorized

Is BLEU valid? First observations and concerns

Aug 8, 2017 ehudreiter2 Comments

The first phase of my systematic review of BLEU shows that BLEU-human correlations are all over the place, and that none of the studies in my review have correlated BLEU with real-world utility or user satisfaction.

Uncategorized

How do Users React to NLG?

Jul 21, 2017Jul 21, 2017 ehudreiterLeave a comment

Some obervations on how people react to NLG systems (which is a very different issue than scientific evaluation).

Uncategorized

Study Design for Systematic Review of BLEU Validity: Comments Welcome!

Jun 13, 2017Jun 13, 2018 ehudreiter5 Comments

I’m planning to do a systematic review of the validity of BLEU, and am very keen to get comments and suggestions on study design from others!

Uncategorized

Regression to Mean

May 18, 2017May 18, 2017 ehudreiter1 Comment

Some explanation and advice about regression to mean, which is a statistical phenomena that can impact NLG evaluations.

Uncategorized

How to do an NLG Evaluation: Metrics

May 3, 2017May 5, 2017 ehudreiter6 Comments

I am really dubious about evaluations based on BLEU and other metrics. I explain why, and also give advice on best practice for people who are committed to using metrics

Posts navigation

Older Posts
Newer posts
  • LinkedIn
  • Twitter

News: I am likely to retire in summer 2026. Looking for interesting things to do afterwards.

Top Posts & Pages

  • What LLMs cannot do
  • Publish in Journals!
  • Do LLMs cheat on benchmarks
  • Generated Texts Must Be Accurate!
  • Is building neural NLG faster than rules NLG? No one knows, but I suspect not.
  • We need better LLM benchmarks
  • Benchmarks distract us from what matters
  • Do We Encourage Researchers to Use Inappropriate Data Sets?
  • Google: Please Stop Telling Lies About Me
  • We Need Robust Ways to Select Content of NLG Texts
Blog at WordPress.com.
Ehud Reiter's Blog
Blog at WordPress.com.
  • Subscribe Subscribed
    • Ehud Reiter's Blog
    • Join 100 other subscribers.
    • Already have a WordPress.com account? Log in now.
    • Ehud Reiter's Blog
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...