In addition to the half-dozen PhD students I supervise at Aberdeen, I also try to help a number of PhD students at other institutions. One of these students is working in a healthcare domain, and has been spending a lot of time reading about end-to-end neural approaches. We had a chat recently, and I told him that I doubted such approaches would work in his domain: not enough training data, and major safety issues if the system hallucinated or otherwise produced inaccurate texts. He responded that he felt he had to go down this route, because he needs to publish, and this is the only kind of paper which he sees at ACL-type conferences.
I was a bit shocked by this, but I do understand why he got this impression. And this really bugs me. I’m all for researchers working on end-to-end neural if its their passion, but people (especially early-stage researchers) should not think that this is the only option! Especially because, to be honest, I am unimpressed by what I have seen to-date from end-to-end neural.
End-to-end neural NLG does not work in 2020
Last week I read a 2020 end-to-end neural paper where the author was proud that 50% of the sentences produced by his system were factually accurate! And this was with a fairly narrow definition of accuracy. A few months ago we looked at an end-to-end neural system which won a best paper award, and discovered that it produced some texts which were completely nonsensical; they weren’t just wrong, they didnt make any sense from a semantic perspective. Again I was not impressed.
My view is that at least in 2020, end-to-end neural NLG techniques do not generate texts of acceptable quality, if the texts are longer than 10-20 words. This is certainly true for data-to-text, which is what I focus on. I suspect its also true of other NLG applications. For example, I am now seeing a lot of work on generating fiction, presumably since factual accuracy is not a problem in fiction. Consistency however is important. For example, in a computer game, a non-player-character (NPC) cannot say on one occasion that he was born in Swordville, and on another occasion that he was born in Magictown. Although this is not my speciality, what I have seen of fictional material produced by neural end-to-end suggests that these systems struggle to maintain consistency in longer texts.
Of course many researchers are working on these problems, and I wish them the best of luck! Although I also see a lot of neural end-to-end researchers who either ignore accuracy completely, or evaluate accuracy in ways which are not very meaningful. Craig Thomson and I have proposed a way of evaluating accuracy which we believe is rigorous, but it is expensive and time-consuming; we’ll see whether researchers (and indeed reviewers) prefer “cheap-but-dubious” or “rigorous-but-expensive” evaluations.
Use ML for NLG Tasks and Components
Another approach, which to me seems much more successful, is to use machine learning (of all types, not just neural deep-learning) to build NLG components and perform NLG tasks. For example, I think ML could be hugely useful in content selection, where the NLG system decides which insights to include in its text; note that as long as all potential insights are accurate, insight selection will not lead to hallucinated inaccurate texts. I also think ML would be a huge help in lexical choice, although here I would recommend a transparent learning framework instead of neural, so that we can quality-check the model for accuracy problems. I’m also seeing growing interest, commercially as well as academically, in using ML paraphrasing techniques to add variation to NLG texts.
I could go on, but the point is that if I’m trying to build a useful NLG system in 2020 and want to use ML, using ML in tasks and components makes a lot more sense (at least in the kinds of NLG that I do) than end-to-end approaches. Indeed, Castro Ferreira et al 2019 showed that using ML within NLG components resulted in better texts than end-to-end approaches.
Another intriguing possibility is to use ML within authoring tools, ie to help NLG developers create content for a rules/template-based NLG system. Academics seem to have zero interest in this kind of thing, which is a shame, since I think there is real potential here.
I’m not saying that people should stop working on end-to-end neural! But I do strongly believe that young researchers should not feel forced to use this approach. In general I think academic research is more useful to society if academics explore many approaches instead of all doing similar things. Perhaps this is less true if one approach is clearly better than others, but as above I do not see any evidence that end-to-end neural is the best way to build NLG systems.