Challenging NLG datasets and tasks

Several times over the past few months I’ve been a bit annoyed at papers which use heavy-weight deep learning technology to tackle a fairly easy NLG task such as E2E (generating short sentences which summarise features of a restaurant). I should say that I have a huge respect for the 2017 E2E challenge! It was a milestone in neural NLG which highlighted and explored many key issues such as hallucination. But from the perspective of 2021, I wish people interested in neural NLG would focus on tasks and datasets which are more challenging for rule and template based approaches, in order to show that neural approaches “add value.” This is hard to do with E2E, since we can build a decent rule-based E2E system in a day using a tool such as Arria NLG Studio or indeed just writing Python code.

In other words, if I put on my “commercial” hat, I can imagine a discussion as follows with a client who wants an E2E system

Mr Smith, there are two ways we can build your NLG system:

  • Rule-based: It will take us a day to build the system, plus another few days for quality assurance, integration, and deployment. The system should always produce decent-quality 100% accurate texts. If it doesnt, file a bug report and we can easily fix the system. The system is also easy to change if you want to tweak its language or behaviour.
  • Neural: We can build the model in an hour, but it will probably take a few days to clean and prepare the data (several weeks if we have to ask humans to write training texts). Plus a few days for quality assurance, integration, and deployment. The system will produce some really nice texts, but unfortunately it will also sometimes produce low-quality or inaccurate texts. Fixing bugs and modifying/tweaking behaviour or language will be difficult.

I can tell you when presented with the above choice, 99% of the time Mr Smith will opt for the rule-based system! So I would like to see neural NLG researchers focusing on tasks and datasets which are harder (or impossible) to do with rules and templates.

Below are some suggestions for more challenging datasets and tasks. I focus on tasks which I have encountered in a commercial context, because I have a better understanding of whats involved in these. For example I wont discuss WebNLG and ToTTo, since these are very different from any commercial projects I have worked on.


Generating weather forecasts is one of the oldest applications of NLG. It is possible to build very good rule-based NLG systems to generate weather forecast; indeed in an evaluation forecast readers preferred texts generated by our SumTime forecast generator over texts written by human forecasters. However, the effort required is non-trivial (person-months of effort, hundreds or thousands of rules and templates), especially if different types of forecasts are required, or if forecasts must be tailored for individual users or dialogue contexts. So while this task can be done by rule-based systems, neural approaches could make sense if they reduced development time and effort, while still producing excellent forecasts. Facebook has developed a neural weather forecast generator which may be deployed.

In terms of datasets, Facebook has released its dataset; note its texts were written by annotators specifically to train neural NLG models, they are not “naturally occurring” weather forecasts. People who want to train on actual weather forecasts written by human forecasters are welcome to use the dataset from our SumTime project. But please do NOT use the “weathergov” dataset, since its texts are the output of a template based system.

Automatic journalism

There is a lot of interest in using NLG to produce news and sports stories. To take one small example, the BBC used Arria Studio to generate election reports. Machine learning techniques can also be used her; indeed the first commercial application of ML-based NLG that I am aware of is Kondadadi et al 2013, who generated short financial news stories.

In any case, rule-based NLG systems do reasonably well at generating news stories, However, they often lack flexibility, ie the ability to adapt the structure and content of the narrative based on the specific circumstances of the story. If neural systems could do this well while still producing accurate and readable stories, they could have advantages over rule-based NLG systems. This requires being able to generates readable narratives which are hundreds of words long and contain no hallucinations.

Probably the best known journalism dataset in NLG is the Rotowire dataset of basketball summaries. If people are interested in this, I recommend they look at Craig Thomson’s SportSett dataset, which fixes many of the problems in the original Rotowire dataset.

Discharge summaries

I was trying to think of a really challenging dataset and task which is hard to do with rule-based NLG and has arisen in commercial discussions, and one possibility is generating discharge summaries, which summarise what happened to a patient during a hospital stay. These are difficult for rule-based systems because they need to summarise an enormous range of potential clinical data and interventions, for patients with extremely diverse problems, in a short narrative which summarises a hospital stay which lasts days or weeks. Also the clinical data is noisy and human-written discharge summaries may contain abbreviated language and indeed in some cases are incorrect. So if neural NLG can reliably produce high-quality discharge summaries from clinical data, I will be impressed!

One potential data set is MIMIC ( I have never used this myself, but I believe it contains both clinical data and human-written discharge summaries.

5 thoughts on “Challenging NLG datasets and tasks

  1. I would add one comment on the Rule-based vs Neural:

    Rule-based uses exclusively structured data as an input.
    While Neural can use text as an input (e.g. GPT-x), and sometimes structured data.

    When structured data is not available, rule-based approach is not possible.

    Please amend if I’m wrong.

    Liked by 1 person

  2. Hi, I’m focusing on data-to-text (which is my interest) here, not text-to-text. One can certainly do some text analytics within data-to-text, such as sentiment analysis or information extraction, via an analytics module which analyses texts in the input and effectively produces structured data which goes into the D2T system (which can be either rule-based or neural). But text summarisation, for example, is a very different task which has its own data sets (which are very different from above)

    Liked by 1 person

  3. LOL. I’m working on a system to summarize-press-releases when I run into your name again and again. I said to my self ~this guy might know something.~ Google, then here.
    I skimmed thru the titles in your blog; this one caught my eye. The first sentence made me laugh out loud. I thought the same thing, but the answer your question is simple.
    All the “cloud platforms” – AWS, Google Borg, M$ Azure, and others – all need customers, so they have provide “free” services if they use their platforms and publish. Some do not even require credit – as their own (respective) PR will deal with it.
    Since the days of the movie “The Fifth Element”, I have been on-again off-again on Robots and AI. I’m now back. I have plan. I know this will work. However, I need to read more of your work. I think we are on parallel paths.
    ALL the Best
    On Twittter as your @borderObserver


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s