Texts produced by NLG systems can be evaluated in terms of accuracy (content is correct), fluency (text is readable), and utility (text is useful). I discuss these three “dimensions” of NLG evaluation.
I’ve been shocked by the fact that many neural NLG researchers dont seem to care that their systems produce texts which contain many factual mistakes and hallucinations. NLG users expect accurate texts, and will not use systems which produce inaccurate texts, not matter how well the texts are written,
Some thoughts on key NLG challenges in explainable AI: evaluation, conceptual alignment, narrative. Comments are welcome!
Unfortunately, I see many students (and indeed other people) make some basic mistakes when evaluating machine learning, for classifiers as well as NLG.
I was recently asked by someone if it was possible to easily determine whether an NLP system was good enough for a specific use case. Currently this is very hard. Making it easy could be a “grand challenge” for evaluation!
In both NLG and MT contexts, deep learning approaches can result in texts which are fluent and readable but also incorrect and misleading. This is problematical if accuracy is more important than readability, as is the case in most NLG contexts.
Many neural NLG systems “hallucinate” non-existent or incorrect content. This is a major problem, since such hallucination is unacceptable in many (most?) NLG use cases. Also BLEU and related metrics do not detect hallucination well, so researchers who rely on such metrics may be misled about the quality of their system.