building NLG systems

Challenges are Same for Neural and Rule NLG

I recently had a chat with a PhD student (not from Aberdeen), who told me that a few years ago she had presented a paper at a conference, and afterwards started talking to another researcher who seemed very interested in her work. Until the student said that she was using rule-based NLG, at which point the other researcher lost all interest and indeed made some derogatory comments about rule-based NLG.

I find this sort of attitude bizarre. Partially because in the kind of data-to-text NLG I do, its a lot easier to build production systems using rule-based NLG than neural NLG. But mostly because I think the fundamental challenges of applied NLG are the same, regardless of whether we build systems with rules or deep-learning, and we should welcome all research on these challenges regardless of the technology used to build the system.

I also am disappointed that only a minority of neural NLG researchers seem to engage with the below challenges . I suspect this is partially because of the leaderboard fixation, which steers people to tweaking neural models rather than investigating deeper issues.

When is NLG Useful

One of the most important challenges in NLG is figuring out when it is useful. For example, when do people want to see information presented as texts, and when would they prefer to see data visualisations? How much information should be presented in a text? How should users interact with an NLG system in order to get additional information, explanations, etc (dialogue? clicking on words in a text?).

I suspect these are fundamentally HCI question more than AI questions. But regardless, they apply to all NLG systems, regardless of how the systems work. However, investigating this issue requires moving away from a fixation on beating state-of-art in a leaderboard, and instead asking whether the task the leaderboard is based on is actually sensible and useful in the real world.


Content is king in data-to-text NLG. Texts need to communicate useful insights about data, otherwise they are not useful no matter how well they are written. Hence choosing insights/content is an extremely important challenge in NLG. In my experience, once texts have an acceptable level of readability, users are more interested in better content than in better expression.

We can choose content (insights) using rules or machine-learning (or a combination). I personally have worked on a number of NLG systems which used ML for content and rules for expression.

Anyways, content matters regardless of whether we use rules or deep learning! However, I believe that its a lot easier to develop techniques for content determination (and understand the underlying principles of “articulate analytics”) if we treat this as a separate task from other parts of NLG, ie do not take an “end-to-end” ML approach to building an NLG system.

Edge cases and robustness

Whenever I talk to anyone building a production rule-based NLG system, one of their biggest headaches is ensuring that good quality texts are generated under all circumstances, especially in edge cases. Indeed, a lot of the research in rule-based NLG is essentially trying to automate processing to make it more robust in edge cases; eg building realisers which are guaranteed to produce grammatically correct sentences under all conditions.

When I look at neural NLG, again robustness is one of the biggest challenges I see for production usage. How can we guarantee that a neural system will always produce accurate, readable, and correct texts? Its encouraging to see some good work on this issue from companies such as Facebook. Unfortunately academics working on neural NLG seem to have little interest in this question, perhaps because of the emphasis in academic evaluations on average-case performance instead of on worse-case performance.

Configuration and control

At the recent INLG panel on what users want from real-world NLG, one topic that came a lot was the need for users (including both end users and companies purchasing NLG solutions for their clients) to be able to control and configure the NLG system. Ie, if a user doesnt like the content or wording of an NLG, she should have some ability to easily change this, in a manner which does not require knowledge of programming, linguistics, or ML. Similarly, if users need to update NLG systems because of changes in the world (eg, domain shift, such as frequently-changing Covid rules), we want this to be straightforward,

This is a real challenge for all NLG systems, regardless of how they are built. For rule-based NLG, its tempting to let users edit rules, but this is hard for people who don’t have NLG development skills, and also raises quality assurance problems. For neural NLG, we need to abandon the idea that the goal is to replicate training data, which of course is central to much of ML. One general issue here is evaluation; how can we evaluate effectiveness of approaches to configuration?


To follow up the last point, it is essential that we be able to properly evaluate NLG systems, regardless of how they are built. There has been a lot of good research on NLG evaluation recently, but little of it has addressed the above issues. We need good techniques for evaluating NLG usefulness, content, robustness, and configuration, and developing these are major challenges.

The first step in addressing the evaluation challenge is to come up with high-quality “gold-standard” evaluations based on deep understanding of the challenges and discussions with real-world users about what is important to them. Metrics should only be developed after we have gold-standard evaluations. From this perspective, I am disappointed that many researchers are fixated on metrics and not interested in protocols for human evaluation.

Final thoughts

If we want data-to-text NLG to be a useful real-world technology, we need to address the above challenges! And the basic issues are the same regardless of whether we build NLG systems with rules or transformers. I’d love to see more research on these challenges from all sorts of perspectives (and less of a fixation on leaderboards), this is the path to genuine progress in NLG.

One thought on “Challenges are Same for Neural and Rule NLG

  1. As you mentioned Content is the king of NLG, when you applied rule based to generate text, the king is in the ML algorithms that identify the content from the data. So it is not strange researchers lost their interests. Rule based NLG is a practical way, but not the core. People are interested in ML NLG also because it could generate something new, the content itself, not the grammar. But as you say, we could not control it well, and do not know how to evaluate it.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s