Many neural NLG systems “hallucinate” non-existent or incorrect content. This is a major problem, since such hallucination is unacceptable in many (most?) NLG use cases. Also BLEU and related metrics do not detect hallucination well, so researchers who rely on such metrics may be misled about the quality of their system.
Unfortunately I suspect many researchers make their results looks better by using poor baselines. I give some thoughts on this, based on a recent discussion with a PhD student.
In response to a previous blog, many people expressed concerns to me about the quality of many papers they saw on ML in NLP. I summarise some of these concerns, which are worrying.
I was recently asked if machine learning requires evaluation metrics. The answer is no, and the fact that people are asking such questions suggests that some newcomers to the field may have a limited perspective on NLP research methodology.
Lexical choice is an area of NLG which really needs machine-learning and data-based techniques.
I went to my first developers conference last week and was impressed, not least by the sensible attitude towards deep learning and other trendy AI technology.
There is a lot of hype around deep learning. especially at business-oriented AI events. I suggest some questions to think about for companies who are considering using DL.
I think we should use rules to make simple high-value decisions, and learning to make complex low-value decisions, within an architecture where ML decision makers are embedded in a rules-based framework.
I am concerned that some people seem to ignore quality issues in training data.
My response to Goldberg’s “adversarial review” of some research on using deep learning in NLG