I’ve seen several papers recently which criticise standard summarisation datasets (CNN/DailyMail, XSum) because they dont contain actual summaries, which means that systems trained on these datasets are not actually generating summaries. I think this is a really important point. These datasets work great from a “leaderboard” perspective (and there are lots of papers about them) since its easy to set up contests based on them. But if our goal is to advance summarisation technology, would it not be better to use datasets which actually contain summaries?
CNN/DailyMail (https://huggingface.co/datasets/cnn_dailymail) is probably the most popular dataset in summarisation research. It contains news articles (published in CNN or Daily Mail), and “highlights” of each article. It is easiest to see this by looking at an example.
http://edition.cnn.com/2007/SHOWBIZ/Movies/07/23/potter.radcliffe.reut/index.html shows the source of the first row in the dataset on Huggingface. The dataset contains Article and Highlight fields. In this example, the Article is the text starting with “Harry Potter star Daniel Radcliffe gains access to a reported £20 million”. The Highlights are the “STORY HIGHLIGHTS” bullet items in the top right of the page; these are concatenated to produce the article summary when this dataset is used for summarisation.
As pointed out by Gehrman et al and others, these Highlights are not intended to be summaries of the Article, their role is primarily to encourage readers to read the article. Generating “clickbait” for news articles may be a real-world NLP task, but is not summarisation!
XSum (https://huggingface.co/datasets/xsum) is the second most popular dataset in summarisation, and is based on articles from the BBC News website. The dataset contains Document and Summary fields. Essentially the Summary is the first line of the article on the website and Document is everything after the Summary on the website. Again this is easier to see with an example.
https://www.bbc.co.uk/news/uk-scotland-south-scotland-35232142 shows the source of the first row in the dataset on Huggingface. The sentence in bold beneath the picture (“Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.”) is the Summary recorded in XSum. The XSum Document is the text beneath this sentence (starting with “The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.”).
The Summary is clearly not a summary of the Document, not least because it contains information not present in the article; for example in this case the Summary says “caused by Storm Frank”, but the Document never mentions Storm Frank. Note that this not a hallucination in the context of the source article, its simply a fact which was mentioned in the first sentence but not subsequently. The XSum “Summary” is what journalists call a lead sentence, whose goal is partially to grab the readers interest and encourage him to keep reading; crucially it is an integral part of the article, not a stand-alone summary.
Its possible that generating lead sentences may be a useful task, although its not something I have heard journalists request (usually they want NLG to automate writing boring details in the article body, and are keen to write the lead sentence themselves). But anyways, lead sentences are not summaries.
Of course there are many summarisation data sets which do contain real summaries! For example, Francesco Moramarco (one of my PhD students) and his colleagues created the PriMock57 data set (https://aclanthology.org/2022.acl-short.65/). Francesco is working on generating summaries of doctor-patient consultations which can be entered into the patient’s medical record (after the summaries are checked and edited by a clinician). This is an important real-world summarisation task. Public datasets are difficult to create for this task because medical data about patients is confidential, but Francesco and his colleagues managed to do this by having actors do mock consultations with doctors, which means that PriMock57 does not contain real personal data. PriMock57 includes evaluations (by doctors) as well as the consultations and summaries.
Unfortunately, PriMock57 is poorly suited to leaderboards, since it is small and uses specialised medical language, and also high-quality evaluations in this domain must be carried out by clinicians. So perhaps not surprisingly, despite the fact that PriMock57 is a real summarisation dataset for an important real-world task, very few people use it. Being somewhat cynical, I wonder if summarisation researchers on the whole want “leaderboard-friendly” data sets more than they want datasets which include genuine summaries for real-world tasks…
I’m not saying that CNN/DailyMail and XSum are useless, it may well be that working on these problems leads to approaches which also work on genuine summarisation tasks. But this needs to be demonstrated, ie we need to see evidence that models developed from CNN/DailyMail and XSum can indeed produce high-quality summaries. Showing that models developed on these datasets work well *on these datasets* is not sufficient! And people working on CNN/DailyMail and XSum should honestly describe their work as generating highlights or lead sentences, they should not say they are generating summaries.
More generally, given the increasing number of papers I see on summarisation, I dont understand why the community doesnt switch to actual summarisation datasets. Is this not the best way to make progress in summarisation?