It is absolutely essential that NLG texts be accurate, and that this is a challenge for neural systems in particular. But being accurate is not sufficient, NLG texts must also communicate the key information and insights that users need to know. Choosing this information is called content selection (or content determination), and it can be hard for rules-based systems to do this robustly.
This came up a few weeks ago when I gave an invited talk at NLDB on “Natural Language Generation and Business Intelligence”, which is a very exciting area in commercial NLG. Since I’m limited in what I can say publicly about Arria (although I have discussed one Arria NLG+BI project in another blog), I focused instead in my talk on the work we did in the (rules-based) Babytalk project, where we compared the effectiveness of visualisations and text narratives for presenting data to clinicians who were treating a baby in a neonatal intensive care unit. We ran experiments where we presented information in different ways, asked clinicians to make a decision, and checked if they made good ones.
I mentioned in my talk that human-written narratives were more effective (led to better decisions) than computer-generated ones, and someone asked me why this was the case. I actually wrote a paper about this years ago, focusing on the fact that humans wrote better narratives. But thinking about this again in 2020, I think this is not the full story. What also makes a big difference is that human domain experts are much better at selecting appropriate content in a wide variety of situations. I give some examples below.
Common cases must be covered
Babytalk was designed to help clinicians make decisions about looking after and treating babies in neonatal ICU. We designed the summaries to give helpful information about the baby’s physiological state, and put a lot of effort into reducing sensor noise and artefacts.
In some contexts, though, the best clinical decision is to re-attach sensors to reduce noise and get better data (since babies kick and move, sensors often become partially detached which means they return a lot of noise). Not surprisingly, Babytalk texts were terrible when “re-attach sensors” was the best decision, since Babytalk texts focused on the baby’s true physiological state, and said nothing about sensor noise.
We knew when we designed Babytalk that “re-attach sensor” was a possible clinical decision, but we put very little thought into supporting this action, or indeed the “do nothing” action, despite the fact that “re-attach sensor” and “do nothing” were common clinical decisions.
In other words, we focused on helping clinicians make complex and difficult decisions about medication and surgery, and ignored the fact that often the best decision is to do nothing or try to get better sensor data. Our lack of interest in “boring” cases hurt the effectiveness of Babytalk’s texts, especially from a content-selection perspective.
Cover as many unusual cases as possible
I’ve written about edge cases in a previous blog. Babies in a neonatal unit suffer from all sorts of problems and conditions. Some are common, but there is a large “long tail” of problems which individually are rare, but collectively impact a large number of babies.
In Babytalk, we had a corpus which consisted of roughly 100 data sets and corresponding summary written by a clinician. This gave us some coverage of common cases, but very little coverage of long-tail cases.
An experienced clinician, in contrast, has decades of experience during which he may have physically worked with thousands of babies, and discussed thousands more with his colleagues. Which means that when such a clinician writes a text, he has much better understanding of unusual problems than Babytalk (based on 100 cases) did! Again this had a major impact on content selection (less so on microplanning and realisation), when Babytalk did not know what information was relevant in unusual cases.
Of course, the ideal way to deal with the above is to train systems on large data sets which show what really happens in a hospital (or elsewhere), including “boring” as well as interesting cases; and which also are large enough to cover a large number of unusual cases.
Unfortunately, in most contexts where NLG systems are explaining data to decision makers, we will *not* have access to a large corpus of human-written explanations, since people do not normally write such things. So end-to-end neural approaches are unlikely to work.
However, I think there is still a lot we can do with data-driven approaches. Usually (at least in my experience), NLG developers do have access to large collections of input data, and it should be possible to analyse these to find common cases (boring as well as interesting) and develop some kind of profile of unusual cases. Data analysts already do this kind of thing, we can take advantage of their work.
More ambitiously, I wonder if we can learn content selection rules from other types of documents? Again in my experience developers often have access to documents which are not the sort of thing we want the NLG system to produce, but nonetheless contain very useful information about what is important in particular cases; for example, discharge letters in clinical contexts. Perhaps we can extract content-selection models from these documents, and use them in our NLG systems?? I’ll leave this as a challenge to my readers, certainly progress here could make a big difference to NLG!
Data-to-text NLG systems must produce accurate texts, and these texts need to communicate what users want to know. Rules-based NLG systems such as Babytalk do well at producing accurate texts, but robustly choosing the most appropriate content can be challenging, and I think data-based approaches could really help here.
3 thoughts on “We Need Robust Ways to Select Content of NLG Texts”
If we take baby steps 🙂 to advance this… use (supervised) machine learning for content selection. The selection would be from a pre-defined options of content and commentary options.
This might generate few less relevant commentary (due to statistical errors of content selection), but at least will ensure accurate content reporting.
Hi, if we are trying to learn a simple content-selection model (as you suggest), this will reduce the amount of training data needed, but we’ll still need some, which could be a problem. Although maybe if we are just learning to select between pre-defined options, we could ask human domain experts (SMEs) to do this task themselves and create training data this way??
Inspiring article. I admire the valuable information you offer in your articles.