I’ve written many blogs which complain about the use of BLEU and other automatic evaluation metrics in NLG; this is also one of the themes of my recent structured review of the validity of BLEU. Despite these concerns, I also recognise the appeal of automatic evaluation. So I present here some thoughts on what I think would be the best way to automatically evaluate NLG systems. I’m looking at the future, ie the tools I’d like to see being used in 5-10 years time. Unfortunately, I cannot recommend or support any existing metric for NLG evaluation.
As always, I am heavily influenced by the medical perspective on surrogate endpoints, which are things that are relatively easy to measure and predict the things we actually care about (eg, mortality). Two good presentations of this are Biomarkers and Surrogate Endpoints and the “Surrogate Endpont” section of How to read a paper. Both papers present criteria for a good surrogate endpoint. The first two bullets from How to read a paper are
The surrogate end point should be reliable, reproducible, clinically available, easily quantifiable, affordable, and show a “dose-response” effect (the higher the level of the surrogate end point, the greater the probability of disease)
It should be a true predictor of disease (or risk of disease) and not merely express exposure to a covariable. The relation between the surrogate end point and the disease should have a biologically plausible explanation
I think NLP automatic evaluation metrics (which are essentially a type of surrogate endpoint) should also be expected to meet these criteria. Ie, metrics should
- be reliable predictors of the outcomes we care about (eg, real-world utility)
- be reproducible by other researchers
- be available to all researchers, preferably as open-source software
- be quantifiable and affordable (usually this is not a problem)
- be useful for relative comparisons (if system A gets a higher metric score than system B, then A should have better real-world utility than B)
- have a theoretically plausible explanation for why the metric predicts the outcome we care about.
To follow up on the last point, one reason why good theoretical explanation is important is generalisability. Validation studies cannot look at all possible contexts and circumstances; they are experiments carried out on specific data sets with specific protocols. If we have a solid theoretical case for a surrogate, that gives us more confidence that the surrogate is likely to work in additional contexts which have not been explicitly tested by validation studies. Note that one consequence of theoretical plausibility is that it is not acceptable to simply train a neural network (or whatever) to predict text or system quality.
So if we look at BLEU in NLG from the perspective of the above criteria, we see that
- Reliable predictor of human evaluation: No (see my paper)
- Reproducible: Yes
- Available: Yes
- Quantifiable and affordable: Yes
- Relative comparisons: No (especially when comparing rule-based and statistical/neural systems).
- Theoretical plausibility: No (why should shared ngrams with a reference text predict human evaluations?)
Can we do better than this?
Different metrics for different characteristics
I think one way forward is to look for a set of metrics which evaluate different aspects of a generated text, instead of insisting on a single metric. We evaluate information retrieval systems by precision and recall, which measure different things, so why not similarly measure different aspects of an NLG system? If someone insists on a single metric, we can combine these different aspects (like the F1 metric in information retrieval), but my focus is on the individual components.
When I evaluate an NLG system using human ratings, I usually ask people to separately assess the following aspects:
- understandability: is the text easy to read? This measures the quality of linguistic processing in the NLG system
- accuracy: Is everything in the text true? This (partially) measures the quality of data processing (analytics and interpretation) in the system.
- usefulness: Is the text useful? A useful text needs to be understandable and accurate, but it also needs to communicate the right information and relationships; this depends on the quality of data intepretation and document planning.
Metrics for understandability
Perhaps we can partially assess the understandability of a text by using standard proofing tools, such as spelling and grammar checkers, as well as readability estimators such as Flesch-Kincaid. Indeed, NLG researchers are starting to investigate this (Novikova et al 2017), although results to-date have not been impressive. Part of the problem is that proofing tools designed to detect common mistakes by human writers do not do a good job of detecting common mistakes by NLG systems.
I think this direction makes sense, but we need better tools to assess the linguistic quality of texts. There is of course a huge psycholinguistic literaure on readability, and I like to think that this could be utilised to develop a robust and theoretically-motivated algorithm to assess understandability, which met the above “medical” criteria.
Metrics for accuracy
In theory, we should be able to check the semantics of generated text against the semantics of the input data and world knowledge. That is, we should be able to get a semantic representation of the NLG text (either directly from semantic representation used by the NLG system, or indirectly by parsing the generated text) and then check whether this semantics is implied (in the logical sense) by the input data and background world knowledge.
Of couse doing this in practice is very hard. Many NLG systems dont use logical semantic forms, we dont have good KBs for world knowledge, and checking whether A implies B is a tough algorithmic problem which is NP-hard or worse (depending on the logic used).
So it isnt going to be easy to get this to work… but it would be interesting to try and see what happens!
Metrics for utility
Automatically assessing utility is the hardest challenge. Partially this reflects the fact that document planning (and associated data interpretation) is the least well understood NLG task. It also reflects the fact that utility depends on task and context; eg, an NLG weather forecast which really helps someone plan a picnic may not be of much use to a gardener.
Perhaps this is one area where we can look at the overlap between generated texts and reference texts. But I suggest doing this at the conceptual level, not ngrams, since this is the unique aspect of utility (linguistic quality should be assessed by the understandability metric).
This is all very speculative, though!
If we are going to automatically evaluate NLG systems, we should aim for techniques which fit the medical criteria for surrogate endpoints, including being theoretically plausible. I suspect this will be easier to achieve if we separately evaluate understandability, accuracy, and utility. Researchers are already looking at using proofing tools to evaluate understandability; we have a ways to go before we have something which is truly useful, but at least we have started the journey. Accuracy is harder to measure; we can think of ways to do this in theory, but getting this to work in practice will be a major challenge. But it would be good to make the attempt! Utility is hardest of all to measure, in part because it is task and context dependent.
In short, this is a tough challenge. But probably a worthwhile one!