At our CLAN (research group) meeting in 14 March, we started discussing Gatt and Krahmer’s review paper on NLG, starting with Section 2 on “NLG Tasks”. It was a nice session, where most of the PhD students and postdocs pitched in when discussing areas they were working on. Several MSc students also came along, and gave their perspective.
Anyways, Gatt&Krahmer talk a fair bit about usage of data-based and ML techniques in most NLG tasks, with the exception of lexical choice (lexicalisation), where they don’t refer to any ML. I made a throw-away comment during the discussion that this was a real shame, because I think that lexical choice is the NLG task that most needs data-based and machine-learning techniques; but its also lexical choice is a difficult area from an ML perspective because its not easy to get data, which may be the reason why people have not pushed ML and data-based techniques. I think there is some substance here, and thought I would expand on this here.
Why lexical choice needs data-based ML techniques
Lexical choice is the problem of choosing contextually-appropriate words to express non-linguistic data; it is related to language grounding. Sometimes its straightforward, such as expressing the KB concept DOG as the lexeme “dog“. But often its much more complex and contextual.
For example, the usage of vague terms such as “heavy” clearly depend on context; eg, a heavy book weighs much less than a lightweight car. And “white” skin has a pink RGB value, while “red” hair has very different RGB from “red” crayon. During the Babytalk project, I remember being surprised when I showed a doctor sensor data from a baby, and he described the baby as “blue” even though he couldnt see it; to him, “blue” was more a description of blood osygen levels (which he saw) rather than the actual colour of the baby.
Geographic descriptors are also less straightforward than people think. For example, an acquaintance from Trois Rivieres insisted that this town was in “Northern” Quebec, despite the fact a quick look at a map of Quebec shows that this city is well south of the geographic centre of Quebec. Perhaps “Northern” is relative to the population-weighted centre rather than the geographic centre?
Another issue is that people map data to words in different ways. Many years ago, during the SumTime weather-forecast generator, I looked at how clock times such as 1800 were lexicalised as phrases such as “midnight” or “by evening“. I observed that
- People associated different core meanings with words. Eg, some people thought “by evening” usually meant 1800, others thought it meant 2100, and still others thought it meant 0000.
- A minority of people thought the meaning of “by evening” depended on sunset time and hence on geography and time of year.
- A minority of people thought the meaning of “by evening” depended on when the end-of-day main meal was eaten, and hence on local culture.
I also saw differences in the usage of near-synonyms in weather forecasts. For example, diminishing wind speeds can be described as “decreasing“, “falling“, or “easing“. One forecaster said he would use “easing” when wind speed was low, whilst another said he would use “easing” when wind speed was high.
Anyways, the huge variation in word usage means that data-based methods are probably the best way to address lexical choice. Ie, lets gather tons of data on how people use words when writing or speaking, and how people understand words when reading or listening, and empirically build a model (including a component for individual variation), using our favourite modelling and learning techniques, of lexical choice! This would be really useful, and also might lead to interesting insights on language.
Why applying ML to lexical choice is hard
Despite the above need, there has been disappointingly little published on using data-based techniques for lexical choice. I worked on this 15 years ago in Sumtime, mentioned above, when we parsed a corpus of weather forecasts into descriptive phrases, aligned the phrases with numeric weather data, and then used decision trees to build lexical choice models. We supplemented the quantitative corpus work with qualitative discussions with both forecasts readers and writers. The end result was not only theoretical insights, but also a weather-forecast generator which produced texts which were sometimes rated (by readers) as better than human-written texts, in part because of superior (compared to human writers) lexical choice.
After a long hiatus, people seem to be interested in this area again. For example, I was very happy to see two papers in this area in INLG 2016, on choosing colour words and trend verbs. But still, this area seems to be a minority interest amongst people working on data-based NLG.
Maybe I am being cynical, but I sometimes wonder if this is because obtaining and aligning data and corpora for lexical choice is hard work. Its not easy to find large corpora which can be aligned to underlying data, and its not easy to get the contextual information (features) which we know is really important in lexical choice. And even if you get an aligned data-text corpus with contextual features, the result is messy, in part because of above-mentioned individual differences. So a lot of “blood, sweat, and tears” on the resource-building side, before you can start experimenting with the latest cool machine learning algorithms. And it sometimes seems that NLP researchers (and indeed NLP conference and journals) are much more interested in playing with and enhancing cool algorithms than in the gruntwork of building high-quality resources; as is perhaps shown by the popularity of the completely inappropriate Weathergov corpus amongst ML in NLG researchers.