Many Papers on Machine Learning in NLP are Scientifically Dubious

Jun 6, 2018 ehudreiter1 Comment

In response to a previous blog, many people expressed concerns to me about the quality of many papers they saw on ML in NLP. I summarise some of these concerns, which are worrying.

Uncategorized

Learning does not require evaluation metrics

May 30, 2018May 30, 2018 ehudreiter3 Comments

I was recently asked if machine learning requires evaluation metrics. The answer is no, and the fact that people are asking such questions suggests that some newcomers to the field may have a limited perspective on NLP research methodology.

Uncategorized

Please Use Two-Tailed P Values!

Jan 29, 2018 ehudreiter1 Comment

If you are writing a scientific paper which presents statistics, please use two-tailed p values unless you **really** know what you are doing.

Uncategorized

My Guidelines for Evaluating AI Systems

Nov 21, 2017 ehudreiterLeave a comment

Ehud’s guidelines for evaluating AI systems: keep it simple, keep it ethical, be careful, do proper stats, and be skeptical

Uncategorized

Testing Multiple Hypotheses

Oct 27, 2017 ehudreiter3 Comments

The NLP/AI community needs to do a better job of dealing with multiple hypotheses, otherwise a lot of our results will be garbage.

Uncategorized

Is BLEU valid? First observations and concerns

Aug 8, 2017 ehudreiter2 Comments

The first phase of my systematic review of BLEU shows that BLEU-human correlations are all over the place, and that none of the studies in my review have correlated BLEU with real-world utility or user satisfaction.

Uncategorized

How do Users React to NLG?

Jul 21, 2017Jul 21, 2017 ehudreiterLeave a comment

Some obervations on how people react to NLG systems (which is a very different issue than scientific evaluation).

Uncategorized

Study Design for Systematic Review of BLEU Validity: Comments Welcome!

Jun 13, 2017Jun 13, 2018 ehudreiter5 Comments

I’m planning to do a systematic review of the validity of BLEU, and am very keen to get comments and suggestions on study design from others!

Uncategorized

Regression to Mean

May 18, 2017May 18, 2017 ehudreiter1 Comment

Some explanation and advice about regression to mean, which is a statistical phenomena that can impact NLG evaluations.

Uncategorized

How to do an NLG Evaluation: Metrics

May 3, 2017May 5, 2017 ehudreiter6 Comments

I am really dubious about evaluations based on BLEU and other metrics. I explain why, and also give advice on best practice for people who are committed to using metrics

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Tag: evaluation

Many Papers on Machine Learning in NLP are Scientifically Dubious

Learning does not require evaluation metrics

Please Use Two-Tailed P Values!

My Guidelines for Evaluating AI Systems

Testing Multiple Hypotheses

Is BLEU valid? First observations and concerns

How do Users React to NLG?

Study Design for Systematic Review of BLEU Validity: Comments Welcome!

Regression to Mean

How to do an NLG Evaluation: Metrics