Unfortunately, I see many students (and indeed other people) make some basic mistakes when evaluating machine learning, for classifiers as well as NLG.
An important difference between different approaches to building NLG systems is the skills needed to use these approaches to build systems. Machine learning requires the most skills, smart templating the least, and simplenlg-type programmatic approaches are in the middle.
In both NLG and MT contexts, deep learning approaches can result in texts which are fluent and readable but also incorrect and misleading. This is problematical if accuracy is more important than readability, as is the case in most NLG contexts.
Many neural NLG systems “hallucinate” non-existent or incorrect content. This is a major problem, since such hallucination is unacceptable in many (most?) NLG use cases. Also BLEU and related metrics do not detect hallucination well, so researchers who rely on such metrics may be misled about the quality of their system.
Unfortunately I suspect many researchers make their results looks better by using poor baselines. I give some thoughts on this, based on a recent discussion with a PhD student.
In response to a previous blog, many people expressed concerns to me about the quality of many papers they saw on ML in NLP. I summarise some of these concerns, which are worrying.
I was recently asked if machine learning requires evaluation metrics. The answer is no, and the fact that people are asking such questions suggests that some newcomers to the field may have a limited perspective on NLP research methodology.