When we evaluate a text produced by an NLG system, in general three are three dimensions that we can look at:
- Linguistic quality, often called fluency, clarity, or readability. Is the text easy to read and understand?
- Accuracy, sometimes called correctness. Is everything in the text true and derivable from the input data?
- Utility, sometimes called usefulness or helpfulness. Does the text help a user do a task, or otherwise achieve its communicative goal?
Of course these dimensions are not independent. For example a text with poor linguistic quality will not be useful, and may have unknown accuracy (if we dont understand the text, we cannot check if its accurate). We can also replace the above fairly coarse criteria by finer-grained ones, eg is the text easy for a 10-year old to read? But in most contexts, at least in my experience, the above dimensions work well. I discuss them below
Linguistic quality measures how well written a text is, ignoring content issues. Its usually measured by asking subjects to read the text and rate its linguistic quality in a Likert scale, or rank a set of texts in order of linguistic quality. Ie, we ask human subjects to judge a text using their intrinsic notion of linguistic quality.
In principle, we could use psycholinguistic means to measure linguistic quality, such as measuring comprehension, reading speed, or recall (memory). These tests are hard to do, though, because of the need for careful control. For example, if we measure reading speed, we need to ensure that subjects are reading texts with similar levels of thoroughness (ie, not skim-reading text 1 and carefully reading text 2). I have done such experiments (Williams and Reiter 2008), and in most circumstances I think they are overkill.
In an NLG context, automatic metrics such as BLEU are better at measuring linguistic quality than the other dimensions (Reiter and Belz 2009). Although I personally would still not trust BLEU to measure linguistic quality.
It is dangerous to focus on linguistic quality and ignore other dimensions such as accuracy. Partially because users care more about utility and accuracy than they do about linguistic quality, but also because you can get silly baselines that do very well at fluency-only evaluations. For example, if you are generating a story about a sporting match and have a corpus of human-written sports stories, then a very effective strategy from a fluency-only perspective is to just retrieve a story about a previous match which is similar in data terms to the current match, and update team and player names and perhaps scores. The result will be a very fluent and well-written sports story which looks plausible but is completely bogus in content terms.
Accuracy in NLG means that at a content level, everything in the generated text is true and derivable from input data or general world knowledge. Note that this is a bit different from accuracy in Machine Translation, which measures whether a translated text has the same content as the source text. The difference is that in an NLG context, its usually not possible for a generated text to communicate *everything* in the input data set, some selection is needed. So accuracy in NLG is based on precision (we want everything in the generated text to be present in or derivable from the input data), but not recall (usually only a small fraction of input data is in the text). Whereas accuracy in MT means both precision and recall; the translated text should ideally communicate everything in the source text, and nothing else.
Another subtle aspect of accuracy in NLG is that if the text says something which is true but not derivable from the input data (or world knowledge), then we count this as inaccurate even if it happens to be true in this instance. For example, as pointed out by Wiseman et al 2017, if a summary of a basketball game says that this was a home game for Team A (ie, played in Team A’s stadium), but the input data does not give the location of the game, then we should count this as incorrect even if it turns out that (by pure luck) this particular match was in fact played in Team A’s stadium.
Measuring accuracy is hard, and I’ve discussed this in detail elsewhere. Basically you need to get someone who understands the domain to carefully “fact-check” the generated text, against the input data as well as ground truth. Other ideas are being discussed in the community, but at the moment the only technique for measuring accuracy which I trust is the above-mentioned fact-checking by a knowledgable person.
As with linguistic quality, accuracy-only evaluations are dubious because its hard to beat silly baselines such as using templates to accurately and literally output ten facts from the input data.
Utility measures how useful a text is, which means that it is defined relative to some task or communicative goal. It is of course influenced by linguistic quality and accuracy, but it is also based on a very important additional element, which is whether the text communicates key insights. For example, if we have an input data set of 100000 elements and derived insights, and we know that ten of these elements are essential to the user’s task, then utility is strongly influenced by whether the generated text includes these essential elements. Ie, from a content perspective, utility is strongly dependent on recall against the “key insight” data set (the text should ideally mention all of the important insights) as well as precision against the complete input data set (everything said should be true).
Utility can be measured directly, for example we can measure how well texts improve decision making (Portet et al 2009) or change behaviour (Braun et al 2018). Such studies are really useful and indeed (at least from my perspective) are the “gold standard” of NLG evaluation. However they are also expensive and time-consuming. So its more common to measure utility by asking real-world users (not Turkers or crowdsourced workers!!) to rate the utility of a text on a Likert scale (Hunter et al 2012). I think this can work as long as the subjects are genuine users and take the time to do this task carefully, but obviously its not as good as directly measuring utility.
I am not aware of any automatic metric which provides a reliable measure of utility.
While linguistic-only or accuracy-only evaluations do not make sense, a utility-only evaluation is acceptable. After all, if the NLG system is producing useful texts, then it works! But regardless, I would definitely suggest measuring linguistic quality and accuracy as well as utility, because I think having all three measures gives better insight as to what is working and what is not working in the NLG system.
A Fourth Dimension of Quality?
It may be worth directly measuring recall against “key insights” in a generated text (as mentioned above), either manually (perhaps similar to the pyramid technique used in text summarisation?) or indeed via metrics which parse generated texts and compare against a “key insight” reference set (I’m beginning to see papers along these lines). The problem with measuring this is that which insights are important depend on what the user is trying to do, which is why this aspect of quality has traditionally has been incorporated into utility judgements. But there may be contexts where it makes sense to try to measure recall of key insights directly, especially in areas such as automatic journalism where a generated text may be read by many people. At any rate, if this is a useful thing to do, then we will have a fourth dimension for evaluating NLG quality! Now we just need a name for this dimension! I’m really bad at names, but maybe something like “insightfulness”????