I am often asked about my views on the role of machine learning (ML) and statistical NLP more generally in the context of Natural Language Generation (NLG). I will get (some of) them off of my chest here!
Some Things Really Annoy Me About ML and NLG
First off, there are some things which really annoy me about current work on ML in NLG.
Lack of awareness of previous (non-ML) work in NLG. For example, I was astonished at a paper presented at a major ACL conference in 2016, which described a neural network ML approach to generating point weather forecasts. This paper showed no awareness of the decades of work in the NLG community on generating point weather forecasts. The paper claimed its results were better than state of the art, but to me the forecasts looked considerably worse than what we were producing 15 years ago in the SumTime project (which of course the authors were not aware of).
Evaluation using BLEU and similar metrics. I have no problem in using BLEU to provide development feedback (which is what it was orginally proposed for), but evaluating an NLG system using BLEU is meaningless, the correlation with human and task evaluations is very weak. I have written and spoken about this elsewhere, for example in a Computational Linguistics paper and an invited NAACL talk. In fairness, many ML for NLG papers do present human evaluations, but there are also many such papers which just rely on BLEU evaluations.
Poor worst-case performance. My experience with ML is that it is brittle, ie ML systems do silly things in unusual cases. For example, for many years my applications for credit cards were usually rejected by the statistical/ML models used to approve such applicants, despite the fact that I was long-term employed in a stable and decently paying job, and had never missed a loan payment. I assume this because in financial terms I am unusual (eg, own neither a house nor a car), so the models didnt know what to do with me (rule-based systems also make mistakes, but these are straightforward to fix; in contrast, its effectively impossible to fix undesirable behaviour in a “deep learning” system). My experience with ML systems for NLG is similar; even if they generate good texts in most cases, they occasionally generate bizarre and inappropriate texts (and the evaluations are always of average case performance, not worst case performance). This behaviour may be acceptable in some contexts (eg, image descriptions to support image search), but it is not appropriate for the NLG systems which summarise and explain data to help people make decisions, which is what I am interested in.
Lack of Corpora and Training Data. ML systems of course need training data, which for a data-to-text system means parallel text-data corpora. Unfortunately for most of the applications I am interested in, such corpora and training data do not exist. I’m hearing a growing number of people make similar observations about ML in other areas of NLP and AI, in both the academic and commercial worlds. For example, at NAACL 2016, the other invited talk (not mine) was by Regina Barzilay, who described how she had to abandon corpus and ML techniques when trying to help a local hospital solve a real-world NLP problem, because there wasn’t enough training data for ML and statistical approaches.
Good Uses of ML in NLG
Now that I’ve gotten my “rants” off my chest, I’d like to say that I think that are many places where ML and statistical techniques can be really useful in NLG. They key is to put aside the view that a 100% ML approach is a “magic bullet” which will solve all of the (NLG) world’s problems, and instead regard ML/statistical techniques as useful tools for solving specific problems and making specific choices in NLG systems, in combination with non-ML techniques. Especially for linguistic “how to say” choices that affect readability (instead of “what to say” choices that affect content), since (A) we can use general corpora of English (or whatever language we are working on), we dont need domain-specific parallel data-text corpora; and (B) in terms of worst-case performance, generating unreadable texts is less of a concern than generating incorrect and misleading texts.
To take a concrete example, consider the NLG task of deciding whether a or an should be used in front of a word (eg, an apple but a banana). Clearly we want to primarily take a ML/statistical/corpus approach, by counting how often a and an occur in front of a word in a corpus (eg, the relative frequency in the corpus of a apple and an apple). It would be madness to try to manually write rules to encode this information! On the other hand, there are some special cases, such as currencies (eg, an £80 meal) and quoted strings (eg, They hired an ‘engineer’ who was clueless), which are best handled by rules. Thus, statistical/corpus approaches should absolutely be used to make the a vs an decision, but best performance requires supplementing them with rules.
This kind of approach makes sense for many other NLG linguistic decisions, including adjective ordering (eg, big red dog vs red big dog), synonym selection (eg, wind speed eased but not voting rates eased), and pronominalisation (eg, John went shopping vs He went shopping). And many other choices as well; this list is illustrative, not exhaustive! These tasks require a much more sophisticated approach then simply counting frequencies in a corpus (as with a vs an), but I think an ML/statistical approach makes a lot of sense, especially if augmented by rules.
There are some decisions which statistical/ML approaches should not be used for, either because rules are well understood (eg, adjective before head noun in English NP), or because the decision depends on a “house style”, and needs to be explicitly parametrisable (such as quote transposition She said “hello.” vs She said “hello”. ). So ML/statistical techniques are not a panacea. But they are certainly useful in many places!
I also think there is huge potential for using statistical/ML techniques for testing and quality assurance (which is really important in commercial NLG work). For example, we can check the correctness of inflected forms in a lexicon by seeing if they occur in a corpus, and (more ambitiously) we may be able to use language models to detect “unlikely” sentences which should be reviewed by a human.
I think ML and statistical techniques can be very useful in NLG, but they are not a panacea. They are the best approach for solving many NLG problems, especially we add rules when appropriate (ie, dont insist on 100% pure ML). But in many contexts they are inappropriate, and other techniques make more sense.
In other words, I take a very “pragmatic” (engineering?) approach to ML/statistical techniques, in NLP and AI generally as well as in NLG. Let’s choose the best tool for the job based on evidence of how well our tools work in different contexts. And sometimes the best tool is ML/statistical techniques, but sometimes a different tool is best.