Recently an MSc student, who is doing a project involving text summarisation, showed me ROUGE scores for some popular Huggingface summarisation models, on the data set of interest. I pointed out to him that all of the summaries he showed me were deeply flawed (eg lots of hallucination, which is unacceptable in the target use case), but he struggled with this because the ROUGE scores were good. All the summarisation papers he had read used ROUGE for evaluation, so wasnt he just following established best-practice when he did likewise?
The student is absolutely right that ROUGE dominates summarisation research. Gehrmann et al (2022) report that “100% of papers introducing new summarization models at *CL conferences in 2021 use ROUGE and 69% use only ROUGE.” Qualitatively, I have also been told by summarisation researchers that it is almost impossible to publish a new summatisation model if it has poor ROUGE scores, regardless of other evidence of effectiveness such as human evaluations. Other approaches to summarisation evaluation, such as Pyramid, seem to have faded to obscurity; ROUGE scores seem to be all the summarisation researchers care about.
This is a striking contrast with how machine translation researchers regard the BLEU metric. In recent years we have seen many careful analyses of the validity of BLEU in MT (eg Mathur et al 2020), many (better) alternatives proposed (discussed in Kocmi et al 2021), and also a lot of work on human evaluation in MT (eg Freitag et al 2021). Kocmi et al explicitly argue that overuse of BLEU has negatively impacted MT research, and my sense is that the MT community realises that it needs to move on, and indeed use a variety of evaluation techniques. From a personal perspective, I have written many papers (eg Reiter 2018) and blogs (eg Why doesnt BLEU work for NLG?) criticising the overuse of BLEU, and I think other researchers have been interested and receptive.
Does ROUGE mean anything?
Given the dominance of ROUGE in summarisation, is there solid evidence that ROUGE is a good predictor of real-world utility? Well, if such evidence exists, I am not aware of it. Certainly ROUGE was a poor predictor of utility in the MSc project described above, and the limited previous work I am aware of (eg Dorr et al 2005) did not find good correlations between ROUGE scores and extrinsic utility measures.
Francesco Moramarco, one of my PhD students, is working on evaluation of computer-generated summaries of doctor-patient consultations, in a context where doctors post-edit the summaries (ie manually fix mistakes in them) before they are released. In a forthcoming ACL paper (arxiv link), Fran and his colleagues carry out a task-based evaluation of their summarisation systems, focusing on time needed for post-editing, and also on number of mistakes (hallucinations and omissions) in the summaries; post-edit time and mistakes are highly correlated, which is not surprising. They then explore how well metrics, including seven variants of ROUGE, correlate with these extrinsic task-based utility measures.
What they find is that simple character-level edit distance (Levenshtein distance) is a better predictor of mistakes and post-edit time than any of the ROUGE variants! Ie, if we want to evaluate the quality of a summary against a reference text, we’re better off throwing out ROUGE and instead just measuring edit distance. Which is a pretty striking finding, considering the above-mentioned dominance of ROUGE in summarisation research.
I should add that none of the metrics (including edit distance) showed what I would consider to be a high correlation. As mentioned in Reiter 2018, what I see in other fields of science suggests that useful metrics should have a correlation of at least 0.7 (ideally 0.85 or higher) with high-quality human evaluations. In Fran’s work Levenshtein edit distance had a correlation of 0.55 with post-edit time when using a human-written note as a reference text (all other metrics he looked at had lower correlations), which of course is below 0.7 .
So logically what we should see in summarisation is a lot more high-quality human-based evaluation (perhaps similar to what Fran did), supplemented with exploring new ideas on automatic evaluation of summarisation. And taking into consideration that there are many different use cases for summarisation (summarising doctor-patient consultations for the medical record is very different from summarising news articles for intelligence analysts), which probably means that we need a range of evaluation techniques!
Will this happen, or will summarisation researchers continue to fixate on ROUGE? I have had some very cynical conversations about this in the past with senior researchers in the field, which was pretty depressing. But I like to think that times are changing, not least because of above-mention developments with BLEU, and we will see a healthier attitude towards summarisation evaluation in the future.