A few months ago, Sebastian Gehrmann and his colleagues published an excellent review of evaluation in text generation (Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text). This is by far the best survey of NLG evaluation I have seen to date, and I strongly recommend it to anyone interested in the topic! Sebastian also gave an excellent SICSA/SIGGEN webinar on evaluation.
One surprising thing about the survey, though, is that very little is said about extrinsic evaluation, ie evaluation which directly assesses whether the NLG system achieves its communicative goal. For example, if the goal of the system is decision support, does using it actually lead to better decisions? I asked Sebastian about this, and he said this was because he and his colleagues found very few papers on extrinsic evaluation of NLG systems. Ie, the survey was supposed to summarise existing work, and in this case there wasnt much existing work to summarise…
I think this is a huge gap! If we genuinely want to know how effective our NLG systems are, we need to directly assess how successful they are at helping people or otherwise achieving their communicative goal. This point was strongly brought home to me last week when I talked to a research colleague from Aberdeen’s Medical School, who was surprised (and not in a positive way) at how little medical NLG research was evaluated based on real-world clinical or cost/productivity outcomes.
Different types of extrinsic evaluation
I wrote about the design of extrinsic evaluations in an earlier blog. Below are some high-level additional comments.
There are many types of extrinsic evaluation. At their best, they carefully measure real-world outcomes in “innovation” and “baseline” groups, and see if there is a statistically significant difference. This was done, for example, by Mani et el (2002) for text summarisation, by Reiter et al (2003) for a smoking-cessation system, and diEugenio et al 2002 for a tutoring system. The dates are telling: NLG researchers seemed keen on extrinsic evaluation 20 years ago, but (as Gehrmann pointed out to me) are much less interested in this in 2022.
We can also do extrinsic evaluations in artificial or simulated contexts; this can be especially useful in medical domains where real-world experiments have ethical risks. For example, Portet et al (2009) evaluated the impact of an NLG tool on clinical decision making in an artificial context, and Moramarco et al (2022) evaluated the impact of a text-summarisation tool on clinical report writing, again in an artificial context.
Its also possible to do extrinsic evaluations without a baseline group. For example, Braun et al (2018) evaluated a driving behaviour change system by measuring how users of the system changed their behaviour over time.
Comparing real-world outcomes against a control/baseline system is the ideal. But I welcome all kinds of extrinsic evaluation, a less-than-ideal extrinsic evaluation is much better than no extrinsic evaluation!
Probably the biggest challenge in extrinsic evaluations is that they are expensive and time-consuming. But the same was true 20 years ago, so it seems bizarre that the NLG/NLP field is doing less extrinsic evaluation than it did in 2002, considering how much larger the field is in 2022. It may be a coincidence but I note that the decline of extrinsic evaluation after 2003 coincided with the rise of metric-based evaluation (BLEU was introduced in 2002 and ROUGE was introduced in 2003).
Incidentally, what really disappoints me about metrics is that they are not validated against high-quality extrinsic evaluations. Not everyone is going to do an extrinsic evaluation, but we could (and should!) insist that metrics are only used if they correlate well with careful extrinsic evaluations. As opposed to the current practice of ignoring the “inconvenient” fact that metrics such as ROUGE have very poor correlation with real-world utility
Perhaps another challenge is that extrinsic evaluations need to be very carefully designed and executed in order to produce meaningful results. I suspect that experimental design/execution skills in NLP/NLG researchers may be weaker in 2022 than in 2002, again largely because so many people in 2022 equate evaluation to running a script to compute metrics.
I’m happy to help!
If anyone who reads this is thinking of doing an extrinsic evaluation and would like some advice and help, feel free to contact me! I suspect I’ve done more such evaluations than most NLP/NLG researchers, and I’m happy to help and encourage other researchers to go down this path.