Many Papers on Machine Learning in NLP are Scientifically Dubious
In response to a previous blog, many people expressed concerns to me about the quality of many papers they saw on ML in NLP. I summarise some of these concerns, which are worrying.
In response to a previous blog, many people expressed concerns to me about the quality of many papers they saw on ML in NLP. I summarise some of these concerns, which are worrying.
I was recently asked if machine learning requires evaluation metrics. The answer is no, and the fact that people are asking such questions suggests that some newcomers to the field may have a limited perspective on NLP research methodology.
If you are writing a scientific paper which presents statistics, please use two-tailed p values unless you **really** know what you are doing.
Ehud’s guidelines for evaluating AI systems: keep it simple, keep it ethical, be careful, do proper stats, and be skeptical
The NLP/AI community needs to do a better job of dealing with multiple hypotheses, otherwise a lot of our results will be garbage.
The first phase of my systematic review of BLEU shows that BLEU-human correlations are all over the place, and that none of the studies in my review have correlated BLEU with real-world utility or user satisfaction.
Some obervations on how people react to NLG systems (which is a very different issue than scientific evaluation).
I’m planning to do a systematic review of the validity of BLEU, and am very keen to get comments and suggestions on study design from others!
Some explanation and advice about regression to mean, which is a statistical phenomena that can impact NLG evaluations.
I am really dubious about evaluations based on BLEU and other metrics. I explain why, and also give advice on best practice for people who are committed to using metrics