One of the biggest challenges in building a usable and useful NLG system is testing and quality assurance (QA). How do we ensure that an NLG system is robust and reliable? This is absolutely critical for commercial NLG systems, and definitely very desirable for many academic and research NLG systems.
A classic saying in military affairs is that “Amateurs worry about strategy, professionals worry about logistics”. I sometimes wonder if we should similarly say in the NLG world that “amateurs worry about algorithms, professionals worry about quality assurance”.
The BT-Nurse Experience
I cannot discuss Arria projects here, but a good example of the importance of testing and QA is Babytalk BT-Nurse. BT-Nurse generates shift handover reports for nurses working in a neonatal intensive care unit. That is, it produces a report which summarises what happened to a baby over a 12-hour shift (and indeed previously as well), in order to inform the nurse who will be looking after the baby in the next shift, and help her plan appropriate care..
We evaluated BT-Nurse by running it live in the neonatal ICU, and asking nurses to rate it and give comments. Somewhat embarrassingly, many of the negative comments were essentially due to software bugs which we had not fixed. These bugs significantly reduced nurse ratings and perceived usefulness of the system; they also of course would be completely unacceptable in a commercial version of BT-Nurse.
We had in fact spent a few months debugging BT-Nurse, which by academic standards is pretty good. But the debugging/testing was not as well organised and structured as it should have been. Also we did the great majority of debugging/testing using a “test” data set, we didnt do nearly enough testing on the live deployed system.
So How do I Test an NLG System?
Of course there is a huge literature on software testing. However, I am not aware of any published paper explicitly on testing NLG systems, and I’m also not aware of many papers on quality assurance in artificial intelligence more generally. We want our AI systems to do amazing (magical) things; how do we test the magic from a quality assurance perspective?
Usually the practice in AI seems to be to run a system on a bunch of test cases and report what percentage of test cases the system got right. But this is not good enough for medical contexts such as BT-Nurse. One problem is that we need to understand and have confidence in worst-case behaviour as well as average case behaviour; eg, will BT-Nurse ever say anything which could harm patient care (as opposed to simply not helping patient care)? Worst-case behaviour is also important if there is an adversary who is trying to defeat the AI system; for example, we dont want AI players in computer games which can be easily defeated by following a strategy which exploits their worst-case behaviour.
Another point is that performance on a test data set may not reflect performance in real-world deployment. This was certainly an issue in BT-Nurse, in part because the real-world hospital environment is always changing and evolving, so the environment at the time the system was used (eg, medication protocols) was not identical to the environment at the time the test data was gathered.
For these reasons, I am not a great fan of simply gathering performance statistics. I think good quality assurance for NLG (and perhaps for other types of AI) requires
- directly inspecting the code, for example by asking other developers to peer review code.
- extensive collection of unit and system tests, which test behaviour at different levels and also can be used for regression testing.
- automatic checking of all generated texts against quality criteria, for example checking for spelling mistakes and use of profanity.
- manual checking of NLG texts by domain experts as well as developers.
None of this is rocket science, but it is very important! Its also important that testing is done by dedicated testers, since it is often difficult for developers to properly test their own code. And last but not least, bugs should be recorded, prioritised, and tracked in a structured and organised fashion, probably using a bug-tracking tool.
It is very useful to get a second pair of eyes to look at any code which has been written, from the perspective of maintainability (can this code by maintained if the original developer leaves) as well as testing. The obvious way to do this is by asking developers to peer review code from other developers. Structured white-box testing can also be performed, although this can be a lot of work.
From an NLG perspective, it is worth keeping in mind that code inspection is difficult or impossible for NLG systems built using some machine learning techniques. For example, if we use deep learning or other neural network techniques, we can inspect the code used to learn and execute the neural network, but we cant inspect the network itself. So if code inspection is an important part of our QA process, we may wish to avoid such techniques.
Unit and System Tests
The heart of most testing activities is “black-box” tests, where we check that the expected output is produced from an input data set. We can do “unit tests” which check functionality at the module level; we can also do system tests which test the NLG system as a whole. Most such testing focuses on functionality, but we can certainly also test non-functional attributes such as speed and concurrency.
From an NLG perspective, one challenge in unit and system testing is that some NLG systems deliberately vary their output in order to make texts more interesting (this is sometimes called “elegant variation”). Which means there are may be many possible outputs for a given input. In principle, the best way to test such systems is to list all possible variations in the test case, but this may not always be possible.
You can see examples of NLG unit tests (for simplenlg) at Githib simplenlg test source.
Automatic Checking of Generated Texts
In many situations, it is useful to automatically check the output of an NLG system for problems, by running generated texts through a spelling, grammar, or style checker. This can be done during development, and also when the system is live (deployed). In some cases we may want to enforce application-specific constraints, such as ensuring that no profanity is used.
There are of course many proofing tools which can be used for this task, both commercial and open-source. Unfortunately none of these tools are 100% accurate and reliable. So such tools are best used to mark out potentially questionable texts which a person should examine.
Manual Checking of Generated Texts
Ultimately, there is no substitute for manually checking NLG texts to see if they are accurate, readable, and useful. This is a time-consuming process, but it needs to be done.
Different aspects of texts may best be checked by different people. Proof-readers are very good at checking that texts are well-written and readable, but may not have enough domain knowledge to check accuracy and usefulness. Testers can check that texts are accurate, by comparing texts to source data; however they may not have the language skills for careful proof-reading, or the domain knowledge to assess usefulness. Subject matter experts are probably best placed to assess usefulness, but (at least in my experiencer) that are often pretty bad at checking language quality and readability, and may not have the time to carefully check accuracy.
As I hope the above makes clear, testing NLG systems is not a trivial activity, it requires a lot of time and organisation. Hopefully testing will become easier in the future, especially if special tools and techniques for NLG testing are developed. But regardless, testing and quality assurance are hugely important; a system which doesnt work is useless commercially, and may get poor evaluations in a research context.
From a practical perspective, we also need to consider the cost of testing when deciding how to build an NLG system. It is often worth spending a bit more on implementation and coding if this decreases testing and QA costs.