There is of course a lot of energy and enthusiasm for using machine learning in NLP and NLG, but it seems almost entirely focused on building “black boxes” using deep learning, in order to perform specific tasks or applications. But there is another way of using ML, corpora, and empirical methods, which is to get insights about NLP/NLG problems and tasks. I personally am much more excited by this use of ML, and wish more people would take this approach.
To get a bit of perspective, lets step back from NLP and look at how other fields use big data and data science. In medicine, for example, there is certainly interest in using deep learning to build automatic tools that analyse case notes or CAT scans. But there is even more interest in analysing clinical data in order to understand which interventions are successful, track the spread of new diseases, identify poorly performing hospitals, etc. Similarly in finance there is interest in using deep learning “black boxes” in automatic trading, but probably more interest in using financial data to identify promising product lines, stock market trends, etc. And in engineering there is certainly interest in building neural control systems for complex machinery, but the bigger goal is analysing data in order to identify failing components, learn how to best operate machines, and improve designs.
In other words, in most of the scientific and technology worlds, I suspect there is some interest in automating low-level functions using neural “black boxes”, but most of the excitement and energy is around analysing data in order to provide insights which support decision making by doctors, finance experts, engineers, etc.
So what I would like to see is more use of corpora, empirical methods, data, and machine learning to come up with insights which guide the building of NLG systems, and less of a fixation on building neural “black boxes”
Example: time phrases in weather forecasts
I’ll give a few examples from my own work.
The first is an analysis we did 15 years ago of word usage in weather forecasts, especially time phrases. This is described in Reiter et al 2005. Basically we aligned human-written weather forecasts with forecasts data, and tried to learn classifiers (using decision trees) which predicted which time phrase, verb, etc would be used. When we inspected the classifiers, we discovered (this is the insight!) that choice of time phrase, etc, was highly idiosyncratic; different forecasters preferred different words. This was a very useful insight, which we discussed with forecast readers, who told us that they disliked idiosyncratic variations because it made it difficult to interpret forecasts.
So based on this insight we developed a technique for choosing time phrases (and other words). Essentially we combined probability that a time phrase would be used for the target time (extracted from a corpus) with the ambiguity of a time phrase, which basically meant the probability that the time phrase would be used for a *different* time. We ran this offline and came up with preferred time phrases. We checked these with domain experts, made a few tweaks based on their advice, and then deployed a system based on these time phrases.
Users loved this, in fact they preferred our computer-generated forecasts in many contexts over human-written ones, because of the consistency and clarity in our usage of time phrases and other words.
In short, we did a lot of analysis, including building decision trees, in order to understand how words were used. The most valuable outcome of this analysis was not the classifiers, but rather the insights on word usage. These insights helped us build a better-than-human NLG system.
Example: frame of reference in weather forecasts
Another example is Ross Turner’s work on generating geographical descriptors of regions in weather forecasts, based on frames of references (Turner et al 2009). For example, Ross wanted to generate description such as “rain expected above 100m” (altitude frame of reference) or “rain expected in coastal areas” (coastal proximity frame of reference).
Anyways, Ross discovered from corpora and then human experiments that some accurate descriptions were rare in corpora and disliked by users. For example
- Rain expected above 100m (common)
- Rain expected in urban areas (rare)
- Employment rose above 100m (rare)
- Employment rose in urban areas (common)
In other words, even if “urban areas” was an excellent description of where rain was expected, this phrase was unlikely to be used in corpus texts, and disliked by readers.
Ross investigated this, and concluded that people expected frames of reference to make sense causally. In other words, we can see a causal link between altitude and probability of rain, so we accept descriptions such as “Rain expected above 100m”. But we struggle to see a causal link between altitude and employment, so we dont like descriptions such as “Employment rose above 100m” even if “above 100m” is an accurate description of the regions where unemployment rose.
So again perhaps the most useful result of Ross’s corpus analysis, which again involved building simple classifiers, was not the classifiers themselves, but rather the above insight on how people used referring expressions.
Example: choice of trend verb
A final (and recent!) paper I’d also like to mention is Chen and Yao 2019, who worked on choosing trend verbs such as “climb” vs “soar” based on the size of the underlying change (eg, sentences such as “Microsoft’s profit climbed 28%”). They concluded that it was very difficult to do well at this task regardless of the ML technology used (neural, Bayesian, decision trees, etc), if we looked at this task holistically from an appropriateness perspective, instead of focusing on one single metric. This is partially because there is a lot of overlap in usage of different trend verbs; eg at least 10 different verbs are appropriate if the underlying change is 20%.
In other words, instead of focusing on showing how model XXX delivered a 1% improvement against baseline YYY on dataset XXX, Chen and Yao tried to give insights on the underlying issues and difficulties in doing this task. I personally found their contribution to be much more useful than other papers I have read on this topic, and I wish more papers took this perspective.
I strongly encourage researchers to use ML to provide insights about NLG problems and related linguistic issues. I think this is really useful for both scientific progress and for developing real-world solutions. Indeed, my personal opinion is that such insights are a lot more useful than papers that show how tweaking a neural ML model gives a 1% increase in state-of-art on some very specific task, metric, and data set.