Several times over the past few months I’ve been a bit annoyed at papers which use heavy-weight deep learning technology to tackle a fairly easy NLG task such as E2E (generating short sentences which summarise features of a restaurant). I should say that I have a huge respect for the 2017 E2E challenge! It was a milestone in neural NLG which highlighted and explored many key issues such as hallucination. But from the perspective of 2021, I wish people interested in neural NLG would focus on tasks and datasets which are more challenging for rule and template based approaches, in order to show that neural approaches “add value.” This is hard to do with E2E, since we can build a decent rule-based E2E system in a day using a tool such as Arria NLG Studio or indeed just writing Python code.
In other words, if I put on my “commercial” hat, I can imagine a discussion as follows with a client who wants an E2E system
Mr Smith, there are two ways we can build your NLG system:
- Rule-based: It will take us a day to build the system, plus another few days for quality assurance, integration, and deployment. The system should always produce decent-quality 100% accurate texts. If it doesnt, file a bug report and we can easily fix the system. The system is also easy to change if you want to tweak its language or behaviour.
- Neural: We can build the model in an hour, but it will probably take a few days to clean and prepare the data (several weeks if we have to ask humans to write training texts). Plus a few days for quality assurance, integration, and deployment. The system will produce some really nice texts, but unfortunately it will also sometimes produce low-quality or inaccurate texts. Fixing bugs and modifying/tweaking behaviour or language will be difficult.
I can tell you when presented with the above choice, 99% of the time Mr Smith will opt for the rule-based system! So I would like to see neural NLG researchers focusing on tasks and datasets which are harder (or impossible) to do with rules and templates.
Below are some suggestions for more challenging datasets and tasks. I focus on tasks which I have encountered in a commercial context, because I have a better understanding of whats involved in these. For example I wont discuss WebNLG and ToTTo, since these are very different from any commercial projects I have worked on.
Generating weather forecasts is one of the oldest applications of NLG. It is possible to build very good rule-based NLG systems to generate weather forecast; indeed in an evaluation forecast readers preferred texts generated by our SumTime forecast generator over texts written by human forecasters. However, the effort required is non-trivial (person-months of effort, hundreds or thousands of rules and templates), especially if different types of forecasts are required, or if forecasts must be tailored for individual users or dialogue contexts. So while this task can be done by rule-based systems, neural approaches could make sense if they reduced development time and effort, while still producing excellent forecasts. Facebook has developed a neural weather forecast generator which may be deployed.
In terms of datasets, Facebook has released its dataset; note its texts were written by annotators specifically to train neural NLG models, they are not “naturally occurring” weather forecasts. People who want to train on actual weather forecasts written by human forecasters are welcome to use the dataset from our SumTime project. But please do NOT use the “weathergov” dataset, since its texts are the output of a template based system.
There is a lot of interest in using NLG to produce news and sports stories. To take one small example, the BBC used Arria Studio to generate election reports. Machine learning techniques can also be used her; indeed the first commercial application of ML-based NLG that I am aware of is Kondadadi et al 2013, who generated short financial news stories.
In any case, rule-based NLG systems do reasonably well at generating news stories, However, they often lack flexibility, ie the ability to adapt the structure and content of the narrative based on the specific circumstances of the story. If neural systems could do this well while still producing accurate and readable stories, they could have advantages over rule-based NLG systems. This requires being able to generates readable narratives which are hundreds of words long and contain no hallucinations.
Probably the best known journalism dataset in NLG is the Rotowire dataset of basketball summaries. If people are interested in this, I recommend they look at Craig Thomson’s SportSett dataset, which fixes many of the problems in the original Rotowire dataset.
I was trying to think of a really challenging dataset and task which is hard to do with rule-based NLG and has arisen in commercial discussions, and one possibility is generating discharge summaries, which summarise what happened to a patient during a hospital stay. These are difficult for rule-based systems because they need to summarise an enormous range of potential clinical data and interventions, for patients with extremely diverse problems, in a short narrative which summarises a hospital stay which lasts days or weeks. Also the clinical data is noisy and human-written discharge summaries may contain abbreviated language and indeed in some cases are incorrect. So if neural NLG can reliably produce high-quality discharge summaries from clinical data, I will be impressed!
One potential data set is MIMIC (https://mimic.physionet.org/). I have never used this myself, but I believe it contains both clinical data and human-written discharge summaries.