I am excited by the idea of using error annotation to evaluate NLG systems, where domain experts or other knowledgeable people mark up individual errors in generated texts. I think this is usually more meaningful and gives better insights that asking crowdworkers to rate or rank texts, which is how most human evaluations are currently done.
The most meaningful evaluation is when we test whether an NLG system actually achieves its communicative goal, eg helps people make better decisions or write documents faster. Unfortunately such “extrinsic” or “task” evaluation is rare in NLP in 2002, we need to see more such evaluations!
The ROUGE metric dominates evaluation of summarisation, and I do not understand why. I am not aware of good evidence that ROUGE predicts utility, and recent work by one of my students shows that character-level edit (Levenshtein) distance against a reference text is a better predictor of utility than ROUGE.
Some of my PhD students have recently looked at how many mistakes people (professionals, not Turkers) make when they do NLG-like tasks. The number of mistakes is considerably higher than we expected (although still much lower than the number of mistakes made by current neural NLG systems).
I teach an MSc course on Evaluating AI. which several people have asked me about. In this blog I give an overview of what is in the course. Hopefully this will be useful to people who are interested in learning about (or teaching) evaluation.
The real world usefulness of NLG systems depends on many different factors, not just accuracy and fluency of generate texts. We should evaluate real-world utility of our systems, and check how well existing evaluation techniques (metrics and Turker-based human evaluation) correlate with real-world utility.
I encourage students to have “exercises” where they critically read an academic paper, looking for problems in evaluations. This will help develop skills for writing as well as reading papers. So give it a go!
I’m a strong proponent of human evaluations, but they need to be high quality in order to give meaningful results; a quick/cheap/sloppy human evaluation may not be very useful.
I was impressed by a recent paper by Läubli et al which experimentally compared the results of different human evaluations in MT (eg, how do results differ between expert and non-expert human raters), in the context of understanding when MT systems are “better” than human translators. Would be great to see more experimental comparisons of different human evaluations in NLG!
Craig Thomson and I will present a paper at INLG on a methodology for evaluating the accuracy of generated texts, based on asking human annotators to mark up factual errors in a text. This is not cheap, but I think it is the most robust and reliable approach to measuring accuracy.