Although my career has focused on Natural Language Generation (NLG), I have also occasionally done some work in medical informatics. I have always been impressed by the rigour and seriousness of evaluation in medicine, and also by the constant questioning of “accepted wisdom” in evaluation methodology and whether this needs to change. At the same time, I have often been *depressed* by the lack of rigour, seriousness, and questioning in evaluation of NLG (and indeed NLP more generally). Evaluation of NLG is certainly much better than it was when got my PhD in 1990, but on the other hand it has a long way to go before it reaches the standards of medicine.
Unfortunately, a lot of this difference seems to be a consequence of different attitudes. In medicine, the fundamental attitude is “evaluation is about testing scientific hypotheses, and we want to do this rigorously so that our results are meaningful”. In NLP, on the other hand, the attitude sometimes seems to be “evaluation is a game where we are trying to outscore the competition” and “we need to this as quickly and cheaply as possible whilst keeping the game interesting and challenging”.
Around 10 years ago, I had a conversation with an academic I know (not someone at Aberdeen) who works in another area of NLP. I told my colleague that I was concerned that the “standard” evaluation methodology in his part of NLP was not very meaningful, and in particular there was little evidence that good evaluation scores correlated with good real-world effectiveness. My colleague told me that he agreed with much of what I said in the abstract, but on the other hand the standard evaluation techniques in his area were (A) quick and cheap to execute, (B) accepted by reviewers and funders, and (C) generally gave excellent scores to his systems. So “of course” he was going to use these techniques, it would be foolish to do anything else. I cant fault my colleagues logic, which was impeccable. But while such attitudes may be logical for individual researchers, they hurt the field as a whole; progress in science is difficult without rigorous hypothesis checking.
Lesson One from Medicine: Experimental Rigour
There recently has been a lot of concern in medicine and psychology because many attempts to replicate (important) previous studies have failed; eg, when experiments which purported to show that a treatment was effective were rerun by other experimenters, the treatment (in the new study) was not actually effective (or perhaps was much less effective than originally claimed). This matters, because an experiment which cannot be repeated does not tell us much about the world.
In theory, statistical significance is supposed to guarantee replicability; a p value of 0.05, for example, essentially means that there is a 5% (0.05) chance that the result is due to noise/chance, and a 95% chance that it a genuine result which can be replicated. But empirical analyses of replication studies (where other researchers attempt to replicate an important experiment) show that this is not the case. For example, Ioannidis’s analysis of 49 highly-cited medical experiments showed that 9 out of 39 (23%) “significant” findings from randomised controlled trials could not be replicated; for studies with less rigorous experimental design, 5 out of 6 could not be replicated! In other words, a very rigorous experiment with a “significant” result had an 80% change of being replicated; for less rigorous experiments, the chance of replication was well below 50%. This by no means an isolated finding; for example, Begley and Ellis report that only 6 out of 53 (11%!!) of landmark experiments in oncology could be replicated. Nor is the problem limited to medicine; an analysis of 100 major experiments in psychology showed that only 36% could be replicated.
Ioannidis, Begley, and others have analysed what differentiates replicable from non-replicable experiments. The most important differentiator is solid experimental and statistical design and full reporting of results, including negative results; it is also essential to avoid post-hoc tweaking of hypotheses and statistical analyses. None of this is rocket science, mostly it just means following established “best practice” in experimental design and statistical analysis. Partially in response to these findings, the medical research community is trying to fix some of these issues by insisting that experimenters register their experimental and statistical design on a website (such as clinicaltrials.gov) before the experiment starts; this makes it possible to detect post-hoc “tweaking” and unreported negative results.
Rigour in NLP/NLG?
So medical researchers and psychologists realise they have a problem, and the field as a whole has acknowledged the problem and is trying to do something about it. What is the situation in NLP/NLG?
It is depressing, in all honesty, as I mentioned above. What most angers me is the use of automatic evaluation metrics such as BLEU and ROUGE. These are what a medical researcher would call surrogate endpoints. In other words, they are relatively easy to compute metrics which are claimed to correlate/predict the things we actually care about, such as real-world usefulness. If we look at summarisation, for example, to directly assess whether an algorithm/system is useful requires a complex and time-consuming experiment such as the Summac study. Using Rouge to evaluation a summarisation system is **much** quicker and easier. But using Rouge to evaluate a system only makes sense if we believe that Rouge correlates with and predicts the result of a proper evaluation such as Summac; in other words, if Rouge is a surrogate endpoint which predicts the real endpoint which we care about. If we dont have such evidence, then a good Rouge score doesnt tell us anything about the real-world effectiveness or usefulness of an algorithm or system.
Unfortunately, the evidence that BLEU and ROUGE correlate/predict endpoints which we care about is pretty weak, as many people (including me) have pointed out. In addition to formal published studies, incidentally, I am often approached by people informally who complain about cases where metrics such as BLEU and ROUGE gave inappropriate results. But despite this evidence, BLEU and ROUGE continue to be heavily used; there is some movement away from them, but it is slow. One of the saddest thing for me is that so few summarisation researchers have done proper Summac-like evaluation studies; if I am feeling depressed. I sometimes wonder if there is something like Gresham’s law in operation, with bad evaluations driving out good evaluations…
Lack of Knowledge?
I sometimes wonder if part of the problem is that many NLP/NLG researchers simply don’t know how to do proper evaluations. One advantage of BLEU and ROUGE is that you don’t need to know a lot about evaluation to use them; just download the software and reference texts, ensure your system produces output in the correct format, and “hit enter”. Whereas doing an effectiveness study (or even a simple “do human subjects like my system” study) requires an understanding of experimental design and statistical analysis which many people don’t have, in part because it often is not taught in NLP courses or textbooks.
Hum… For my next blog entry, I will write something practical about “how to do a good NLG evaluation”, rather than pontificate about the state of science…