CSL journal has just published a paper “Evaluating factual accuracy in complex data-to-text”, which summarises our work in this area. I strongly recommend the paper to anyone who is interested in evaluating the accuracy of texts produced by neural NLG systems.
An example from MedPaLM highlighted to me that generated texts can contain information which is factually accurate but still not appropriate, because (in this case) of its negative psychological impact. There are other such cases, and we should ensure that our evaluation criteria are sensitive to them.
I was very impressed by a recent paper that compared prompting-based MT to MT based on trained models. Results are very interesting; prompting-based MT generates fluent texts which however have accuracy problems. Also the paper itself is an excellent example of a high-quality NLP evaluation, and I recommd it to anyone who wants to do good NLP evaluations.
I am excited by the idea of using error annotation to evaluate NLG systems, where domain experts or other knowledgeable people mark up individual errors in generated texts. I think this is usually more meaningful and gives better insights that asking crowdworkers to rate or rank texts, which is how most human evaluations are currently done.
The most meaningful evaluation is when we test whether an NLG system actually achieves its communicative goal, eg helps people make better decisions or write documents faster. Unfortunately such “extrinsic” or “task” evaluation is rare in NLP in 2002, we need to see more such evaluations!
The ROUGE metric dominates evaluation of summarisation, and I do not understand why. I am not aware of good evidence that ROUGE predicts utility, and recent work by one of my students shows that character-level edit (Levenshtein) distance against a reference text is a better predictor of utility than ROUGE.
Some of my PhD students have recently looked at how many mistakes people (professionals, not Turkers) make when they do NLG-like tasks. The number of mistakes is considerably higher than we expected (although still much lower than the number of mistakes made by current neural NLG systems).
I teach an MSc course on Evaluating AI. which several people have asked me about. In this blog I give an overview of what is in the course. Hopefully this will be useful to people who are interested in learning about (or teaching) evaluation.
The real world usefulness of NLG systems depends on many different factors, not just accuracy and fluency of generate texts. We should evaluate real-world utility of our systems, and check how well existing evaluation techniques (metrics and Turker-based human evaluation) correlate with real-world utility.
I encourage students to have “exercises” where they critically read an academic paper, looking for problems in evaluations. This will help develop skills for writing as well as reading papers. So give it a go!