A few weeks ago Arun et al from Facebook presented a paper on “Best Practices for Data-Efficient Modeling in NLG: How to Train Production-Ready Neural Models with Less Data” (ACL anthology) at Coling, where the paper won an Outstanding Paper Award. A really nice piece of work, which shows what’s involved in getting an end-to-end neural system to the point where it is ready to be used in a production environment! My understanding, by the way, is that “production ready” does not mean “currently deployed as part of a production system”, but regardless the system described in this paper seems much closer to real-world usage than previous end-to-end neural systems I have read about.
I’ve been cynical about end-to-end neural NLG in many of my previous blogs, but this paper shows that end-to-end neural NLG can work in real settings. And its really interesting to look at what was involved in creating such a system.
Domain and Task
The Arun et al system is designed to respond to four types of queries in a dialogue/chat context: Alarms, Times, Reminders, and Weather. A separate model was trained for each of these questions, and the authors state that is “not trivial” to add new types of questions. The authors state that expansion beyond English “may also prove difficult”, but not do expand upon this.
In other words, the authors did not dump a large corpus of dialogues into a neural network and magically come up with a dialogue system which could respond well to any query. Instead they built models for answering specific high-frequency questions. Which makes much more sense to me as an engineer!
I did not see any example outputs in the paper, but the authors do show some human-authored reference texts. These seem to be 10-20 words long, and in particular include discourse markers such as “but”. For example
- Weather (reference text): Next weekend expect a low of 20 and a high of 45. It will be sunny on Saturday but it’ll rain on Sunday.
- Reminder (reference text): Yes, there are 3 reminders. The first two are, buy milk at 7 PM and tomorrow. There’s 1 other reminder.
If the reference texts are representative of the generated texts, then these are considerably more complicated than the simple descriptions in the E2E challenge, but are much simpler than the multi-paragraph data summary texts produced by Arria and its competitors.
I am of course very interesting in the accuracy of generated texts! Arun et al make heavy use of the Tree Accuracy metric (paper). Craig Thomson and I proposed a classification of accuracy errors in a recent INLG paper; under this classification Tree Accuracy detects name and number errors and perhaps some word errors; it does not detect context errors or all word errors. I did not see any formal validation data about Tree Accuracy (ie, recall/precision against gold-standard accuracy) in either paper, but the authors state that they believe Tree Accuracy has near-perfect precision at identifying accuracy errors but is less good at recall. Arun et al say they asked human subjects to evaluate accuracy on a subset of “challenging” texts, but they do not report accuracy data on its own, instead they just report an overall human acceptability judgements which includes grammaticality as well as accuracy.
In any case, Tree Accuracy is of course used to develop the models. Its also used at runtime to detect inaccurate texts from the neural NLG system. In such cases the system reverts to a “backup” template/rule based NLG system, which presumably (??) is the system currently used by the deployed Facebook system.
In summary, accuracy is detected by an imperfect metric which focuses on name and number errors, and inaccurate texts are replaced with the output of a template/rules based system. I dont think this approach would work in the domains I work in (where high accuracy is of paramount importance), but it seems a sensible engineering approach for a consumer-oriented domain where accuracy is still very important but occasional lapses from perfect accuracy are acceptable.
I have often complained in my blogs about low-quality datasets in neural NLG systems. Academics mostly dont seem to care about this as long as they can report “leaderboard” scores, but people building real-world systems do need to take this seriously. Arun et al state that “models require much high-quality human-annotated data”, and describe a data collection process where
- user queries and scenarios are generated by engineers, in order to achieve “balanced” data collection (I assume this also helps to achieve good coverage of edge cases)
- annotated responses are created by human annotators following guidelines written by computational linguists.
- responses are verified to be grammatical and correct by linguists
Arun et al grouped scenarios into buckets at different levels of granularity, and developed Dynamic Data Augmentation (DDA) techniques for augmenting human responses in a bucket. Together with sequence-level knowledge distillation, this enhanced the consistency and stability of the system, presumably by reducing the variability of the human training data.
In other words, we know that different people can response in very different ways to the same query, so simply training on human data can lead to systems whose output texts can change radically if there is a small change to the query. While some variation is good in NLG systems, Arun et al seem to believe that too much variation is bad (and can potentially lead to quality issues??), and so have taken steps to control variation. This is a really interesting point for me; we need a better understanding of the desirability of different levels of variation in different contexts.
In short, data collection was an engineering process with a lot of quality assurance; which makes sense to me as engineer!
I was also very impressed by the frequent references in the paper to system attributes which are mostly ignored by academics, but which are really important in production systems, such as latency, model size and data requirements, development resources, and maintenance resources. They dont go into detail (this is after all a paper for an academic conference), but these sorts of issues are of course extremely important when building real-world applications.
To me, the most important high-level message of this paper is that building a production end-to-end neural NLG system is an engineering task, which includes
- Finding a domain and use case where end-to-end neural works and adds value (and where near-perfect accuracy is not required), which also provide significant business value to the organisation.
- Monitoring accuracy of generated texts at runtime, and reverting to a “back up” rule/template NLG system if there are doubts about accuracy (or other aspects of quality?).
- Putting a lot of effort into data collection, to ensure quality and coverage
- Trying to understand how much variation is desirable
- Seriously addressing engineering issues such as latency, model size, and maintainability.
Its great to see a paper which focuses on these issues, and I hope to see more such papers in the future!