I was shocked when a PhD student recently told me that he thought he had to focus on end-to-end neural approaches, because this dominates the conferences he wants to publish in. I’m all for research in end-to-end neural, but fixating on this to the exclusion of everything else is a mistake. Especially since end-to-end neural approaches do not currently work very well.
Craig Thomson and I will present a paper at INLG on a methodology for evaluating the accuracy of generated texts, based on asking human annotators to mark up factual errors in a text. This is not cheap, but I think it is the most robust and reliable approach to measuring accuracy.
Accuracy errors in NLG texts go far beyond simple factual mistakes, for example they also include misleading use of words and incorrect context/discourse inferences. All of these types of errors are unacceptable in most data-to-text NLG use cases.
We’re thinking of organising a shared task on evaluating the accuracy of texts produced by NLG systems. Comments welcome, also let me know if you might participate.
I’ve been shocked by the fact that many neural NLG researchers dont seem to care that their systems produce texts which contain many factual mistakes and hallucinations. NLG users expect accurate texts, and will not use systems which produce inaccurate texts, not matter how well the texts are written,