Over the past few months, I have recorded some videos for Arria on good methodology for building NLG systems (some of these are better than others). Doing this has really brought home to me that good engineering methodology is really important when building an NLG system! Academics love to talk about algorithms, but I suspect methodology is more important when building systems.
In software engineering, I think its useful to distinguish between methodologies for requirements analysis, design, implementation, testing, and support. At any rate, that’s how I have organised the videos and indeed this blog. Below are some high-level comments on how I think these tasks should be done.
Requirements Analysis
The most important part of building a software system is deciding what the system will do, in other words requirements analysis. If we get requirements wrong, we’ll waste a huge amount of time and money building something which may be very clever but which is useless. Indeed, I think most of the big disasters in IT have been due to flawed understanding of requirements.
Of course there are lots of generic techniques for requirements analysis which can be adopted to NLG, such as rapid prototyping. I highly recommend that these techniques by supplemented by corpus analysis, which is an NLG-specific requirements technique which I have used for decades.
In part because NLG is a new technology, it is common for clients to want to change requirements as a project progresses and they become more familiar with what NLG can and cannot do. Some of these proposed changes to requirements may be quick and easy, others may require massive changes to the code. There is definitely a project and relationship management challenge in deciding when to accept modified requirements, and when to reject them unless the project is given more time and resources.
Design
In my experience, the single most important design decision in NLG is designing the interface between analytics and language. The “data-to-text” pipeline includes data analysis and interpretation as well as the usual NLG tasks of document planning, microplanning, and surface realisation. The interface between data-side and language-side processing is “messages”, and getting these right is very important. A good message design will simplify and clarify both data-side processing and language-side processing; a poor message design will lead to the language-side trying to do analytics and/or the data-side trying to do linguistics, which can cause problems.
Of course design is also important within the data-side and language-side components; for example, where is it appropriate to use machine learning techniques? But designing the interface between data-side and language-side processing is the first thing the developer needs to do, and I think the most important.
For better or for worse, designing this interface (messages) is as much an art as a science. People who do this well tend to have lots of experience building systems, a good knowledge of both NLG and analytics, and familiarity with the domain. I suspect designing AI system in general is often an “art” as much as a science, including in deep learning.
Testing
Testing and quality assurance is also very important, and in my experience can be a huge “pain point” for people trying to build production-quality NLG systems. I’ve written about testing and QA for NLG in another blog entry. I wont go into details again here, but I will say that two of the biggest challenges I have seen are testing variation and testing analytics.
We often want our NLG systems to vary the texts they produce, for example perhaps alternating between “the stock market went up” and “the stock market rose”. This kind of variation definitely increases user satisfaction, especially amongst users who read many texts produced by the same NLG system (eg, daily financial market updates). Its also usually not too hard to implement. However, variation can be a major headache in testing. Testers will probably ask to see all possible variants, which may be impossible (eg, if a text has 20 binary variation points, that means a million variants in all). And even if it is possible, it may not be straightforward for the NLG system to “flush out” all the variants. It may make sense to limit the amount of variation in an NLG text purely because of testing concerns.
The other testing challenge I have frequently encountered is testing data-side (analytics) processing. Its relatively straightforward for most testers to assess whether a generated texts is well-written and easy to read, and indeed we can sometimes use spell, grammar, and style checkers to help with this task. But assessing whether the analytics is correct often requires a considerable amount of domain and subject-matter expertise, which many testers do not have.
Support
Software engineering tells us that most of the lifecycle cost of software is in support and maintenance. Although we dont have a lot of experience supporting production NLG systems for decades, I strongly suspect that the same will be true for NLG. The world changes, and NLG systems need to evolve as the world changes. For example, medical NLG systems (such as Babytalk) need to evolve as new treatments and protocols become available, as regulations change, and as users switch to different providers for electronic health records and other IT systems.
I’m not aware of any research on supporting NLG systems, in either academic or commercial settings. But of course standard software engineering advice applies, such as making systems modular so that its relatively straightforward (for example) to slot in a new data connector to a new electronic health record system.
Final Thoughts
Very little has been written about how to do requirements analysis, design, testing, and support for NLG, which is a real shame. I’m happy that Arria is trying to create and pull together some material on this; I would encourage other people and companies to do likewise.