building NLG systems

Maintaining NLG Systems

I’m preparing supporting material for my new book on NLG, and I realised while doing this that I’ve written very little about a very important real-world NLG issue, which is software maintenance of NLG systems (bug fixes, adapting to new data sources, supporting changing user needs, etc). I have written some thoughts about this below, there will be more in the book.

Example: SumTime

In the early 2000s, we worked on an NLG weather forecast generators, and built a system, SumTime. which produced very good weather forecasts; indeed forecast users sometimes preferred SumTime forecasts to human-written forecasts (Reiter et al 2005). We were working with a company, Weathernews, which liked SumTime and operationally deployed it, in a “human-in-loop” context where human forecasters checked and edited SumTime forecasts before they were released to clients (Sripada et al 2005).

All very exciting! But of course Weathernews wanted changes to SumTime: bug fixes, adaptation to new weather data sources, customisation options for clients, changes to accomodate evolving use cases, etc. This kind of thing is normal and expected; any software product which is used by real users needs to adapt and evolve in similar ways. Unfortunately, it was very difficult for us as academics to maintain SumTime software. So after a while we stopped upgrading and maintaining SumTime, which of course meant that subsequently Weathernews stopped using it.

Something very similar happened with the Babytalk BT-Family system (Mahamood and Reiter 2011). The hospital we worked with was keen to keep on using this software after the research project ended, but it was difficult to maintain the system to accomodate changes the hospital’s patient record and IT systems, new interventions and guidelines, etc. So again after a few years the hospital stopped using it.

Maintainable NLG systems

Of course SumTime and Babytalk were research projects. Since we wanted to evaluate these systems in live usage, our research goals included making the systems sufficiently robust and usable that they could be deployed and used by real users during an evaluation period. But our research goals did not include evaluating long-term usage, so we did not prioritise making the systems maintainable.

Another issue is bugs. Babytalk (which was much more complex than SumTime) had a number of bugs which manifested during its evaluation and usage, which would need to be fixed for continued real-world usage.

We could have built SumTime and Babytalk in ways which made it much easier to plug in new data sources, update functionality as the domain and user needs evolved, etc; and we could also have done extensive software testing and quality assurance. But we didnt, because these require a lot of effort which was not justified for an academic research project. Of course commercial NLG providers do put considerable effort into data source portability, configuration, quality assurance, etc.

Maintaining rule-based systems like SumTime and Babytalk is conceptually similar to maintaining other software artefacts; it is definite doable, but requires a lot of effort. However, additional challenges arise with neural NLG systems, especially those using large language models such as GPT:

  • Configuration: Some configuration and domain adaptation can be done via prompt engineering, but this does not offer complete fine-grained control. Some LLMs can be fine-tuned, but we may not always have corpora for fine-tuning, especially if we are trying to adapt the system to a changing world (domain shift).
  • Testing and quality assurance: very difficult for complex stochastic black-box neural models.
  • Model change: another challenge is that closed-source models such as GPT evolve over time. While this is intended to improve performance, it can also break configurations and introduce new bugs.

The above challenges need to be solved if we want LLM-based NLG systems to succeed, especially in professional contexts.

Final Thoughts

Real-world NLG software must be maintained! Its a shame that so little is known about this, I have *never* seen an academic paper on maintaining NLG systems. We once tried to publish a paper about maintaining SumTime, but reviews were very negative. In 2024 “Industry tracks” in xACL conferences in theory solicit this kind of paper, but I dont see them in practice – either they are not submitted or reviewers are still very negative about the topic.

I find this frustrating… As above, software maintenance for rule-based systems can be partially based on conventional techniques; this was the approach Arria took when I was there, and it worked OK. But maintaining LLM-based NLG systems seems far more challenging, and we need to understand how to do this better.

2 thoughts on “Maintaining NLG Systems

  1. Two additional remarks from my end:

    The two examples of BabyTalk and SumTime refer to only one user of the system. Certainly, things get much more complicated when several (or: many) users are involved. They almost certainly will want/need different configurations which adds another dimension of complexity. The – in product managers’ lingo – “Ilities” (i.e. configurability, servicability, installability etc.) typically receive little to no attention in scientific projects, which creates another hurdle for transitioning research prototypes to RL.

    And yes, in the LLM-based systems, the topic of maintaining the system (with possibly many configuration options) while the black box underneath evolves in an unpredictable and undocumented manner is not only a nightmare, but simply a “mission impossible”.

    Like

Leave a comment