I was very impressed by a recent paper from a team at Facebook about a production-ready end-to-end neural NLG system. Especially interesting to me was the “engineering” approach to key issues such as accuracy, data collection, and latency.
I was shocked when a PhD student recently told me that he thought he had to focus on end-to-end neural approaches, because this dominates the conferences he wants to publish in. I’m all for research in end-to-end neural, but fixating on this to the exclusion of everything else is a mistake. Especially since end-to-end neural approaches do not currently work very well.
Many people have asked me if OpenAI’s GPT3 will have a big impact on NLG. I suspect its overall impact will be limited (outside of a few niches), but of course time will tell.
A colleague asked me if it was true that building neural NLG systems was faster than building rule-based NLG systems. The answer is that we dont know, because we dont have good data on this question. However the weak evidence we do have suggests that building rules-based NLG is no slower and may be faster than building neural NLG, at least for data-to-text systems.
Accuracy errors in NLG texts go far beyond simple factual mistakes, for example they also include misleading use of words and incorrect context/discourse inferences. All of these types of errors are unacceptable in most data-to-text NLG use cases.
The BBC used Arria NLG to generate stories about the recent UK election. In this application, texts communicated a meaning, there was no corpus, accuracy was paramount, and domain experts wanted to control the system. Most applied NLG systems I have worked on have had similar constraints.
I’ve been shocked by the fact that many neural NLG researchers dont seem to care that their systems produce texts which contain many factual mistakes and hallucinations. NLG users expect accurate texts, and will not use systems which produce inaccurate texts, not matter how well the texts are written,
Many neural NLG systems “hallucinate” non-existent or incorrect content. This is a major problem, since such hallucination is unacceptable in many (most?) NLG use cases. Also BLEU and related metrics do not detect hallucination well, so researchers who rely on such metrics may be misled about the quality of their system.