Currently most NLG evaluations are either based on subjective human judgements (eg, ratings on a Likert scale) or on similarity to a reference text as computed by some metric. Wouldn’t it be nice if we could define objective criteria for text quality, which could be measured without relying on subjective opinions or reference texts. These criteria could then be used to define gold-standard evaluation methodologies for NLG systems.
Of course once we have these gold-standard criteria and methodologies, we could compute how well subjective human judgements and reference-based metrics agree and correlate with the “gold standard” in different contexts.
Last year I wrote a blog saying that there are three general dimensions for evaluating texts produced by NLG systems: accuracy, fluency, and utility. As a starting point, lets see if we can define these dimensions objectively.
Utility measures how useful a generated text is. We should be able to measure this objectively using extrinsic or task-based evaluation. In other words, we get people to use the system for real, and measure how much it improves their decision making (Portet et al 2009), changes their behaviour (Braun et al 2018), enhance their learning (diEugenio et al 2002), etc.
Task-based (extrinsic) evaluations in NLG are relatively rare, perhaps because they are expensive and time-consuming. They need to be carefully designed and executed, and ideally should be replicated Nevertheless, I think they are the best candidate for objective gold-standard evaluations of the utility of NLG texts.
I believe it should be possible to define accuracy objectively via the number and severity of mistakes in a text. In other words, we can compute accuracy by finding all of the mistakes in a text, perhaps classifying how severe each mistake is, and then (if we want a single accuracy score for a text) combining all of this into a single accuracy score.
This is a much newer idea than using task-based evaluations to measure utility, and many aspects of this approach need to be worked out, including
- What counts as a mistake? Is it just factual errors, or should we also consider cases where the text misleads readers because of incorrect contextual inferences? I think anything which misleads the reader should be considered as a mistake, but there will always be difficult boundary/edge cases.
- How severe are mistakes? If we want to weigh mistakes by severity, how could we do this in an objective manner? Or perhaps this is too difficult and we should just count the number of mistakes?
- How do we find mistakes? What is the best protocol for finding mistakes in a text? Can automatic techniques (based on fact-checking?) supplement human annotation of mistakes?
So lots of questions and issues need to be resolved, but I think in principle defining accuracy by identifying mistakes is workable, and would provide an objective measure of accuracy.
I find it much harder to fluency objectively. Suggestions are welcome!
One approach could be to define fluency via reading speed and comprehension (or some other psycholinguistic factor such as memorability). In other words, we assume that a text is fluent if it is quick to read and easy to comprehend.
However, one problem with this approach is that reading speed varies enormously depending on reader, context, and domain knowledge. Eg, I read a text much more quickly if I skim-read it instead of carefully read it; I also read more quickly if I am well-rested and not distracted. However, I read legal documents more slowly than a lawyer would, since I am not used to legal language and concepts. This variability makes it difficult to use reading speed (etc) to define fluency.
I also suspect that fluency may depend on other factors beyond reading speed and comprehension, such as how well-written a text is. From an applied perspective, if I ask NLG users what they would like to see in a generated text from a language perspective, they rarely talk about reading speed. They do want texts to be easy to understand, but they also want texts to be good narratives or stories. So perhaps criteria for fluency should include narrative quality; but I do not know how to objectively measure narrative quality.
I strongly believe that we need strong “gold standard” evaluations of NLG, and I would like these to be based on objective criteria of generated texts if possible. If I look at where we are today, we know in principle how to measure utility objectively, via task-based extrinsic evaluations, although we rarely do this in practice; hopefully this will change in the future. We have ideas on how to measure accuracy objectively, but there are many details and issues which need to be worked out. It is less clear how to measure fluency objectively, this remains an open question and suggestions are welcome!