NLP technology has changed and advanced over the past two decades, but it often seems that NLG evaluation has not. Why is the 18-year old BLEU metric still so dominant?
We’re thinking of organising a shared task on evaluating the accuracy of texts produced by NLG systems. Comments welcome, also let me know if you might participate.
I really liked Grishman’s recent paper on 25 years of research in information extraction, and summarise a few of the key insights here, about relative progress in different areas of NLP, reluctance of researchers to use complex evaluation techniques, and corpus creation vs rule-writing.
I’m just back from INLG 2019 in Tokyo, where I was very happy to see an increased emphasis on evaluation (and other methodological issues), including several papers on improving human evaluations.
Texts produced by NLG systems can be evaluated in terms of accuracy (content is correct), fluency (text is readable), and utility (text is useful). I discuss these three “dimensions” of NLG evaluation.
I’ve been shocked by the fact that many neural NLG researchers dont seem to care that their systems produce texts which contain many factual mistakes and hallucinations. NLG users expect accurate texts, and will not use systems which produce inaccurate texts, not matter how well the texts are written,
Some thoughts on key NLG challenges in explainable AI: evaluation, conceptual alignment, narrative. Comments are welcome!