One of the topics which most fascinates me is language grounding; in other words, how does language relate to the real world? This is perhaps most easily seen in an NLG context in the problem of choosing words to express data. For example, how do we map visual data onto colour terms such as “pink“, how do we map clock times onto time expressions such as “late evening“, and how do we map geographic regions onto descriptors such as “central Scotland“?
I’ve been interested in this since doing my PhD in the 1980s, but for a long time it seemed that few other people in NLP cared about this topic. In 2003 I organised a NAACL workshop on Learning Word Meaning from Non-Linguistic Data, and although we got a number of computer vision researchers, almost no NLP people turned up. Which is pretty disheartening for an event at a major NLP conference… Several people basically told me that while this was a scientifically important topic, they would not attend because other topics/workshops were more trendy and/or relevant to getting funding. I think this marked the moment when I started becoming disillusioned with ACL conferences. But anyways, the workshop did lead to a special issue of Artificial Intelligence journal on Connecting Language to the World, which was a nice outcome.
Recently, though, language grounding research seems to be becoming more popular and indeed trendy, and I am seeing many papers on this topic, which is a nice. I think a lot of this comes from work on generating image descriptions (which was also a driver back in 2003), and I was very happy to see one of my former PhD students, Meg Mitchell, take a leading role in this area.
But anyways, enough ancient history. What are the challenges in language grounding?
Context, Context, and Context
I strongly believe that the right word to express data depends on context. There are a few cases where the choice of word depends purely on the data being communicated. For example, the temperature 0 K (−273.15 °C) can always be communicated as “absolute zero“. But this is unusual.
In the vast majority of cases, the data->word mapping depends on context. For example, sticking to temperatures, lets look at “hot“. Its meaning depends on (amongst other things)
- other data (in addition to temperature). For example, 30C may be hot if humidity is high, but not if humidity is low
- expectations and interpretation. For example, 30C may be hot in Antartica, but not in the Sahara desert.
- individual speakers. Even in the same location, a Scottish person may call 30C hot, while a Vietnamese does not.
- discourse context. If a previous speaker has used hot to mean 30C, other speakers may align to this usage and do likewise
This is pretty typical; the mapping of data to words usually depends in some way on other data, expectations and interpretation, individual speakers, and discourse context.
Words reflect how people perceive the world, and usually the human experience cannot be captured by a single sensor reading. Perceived temperature (“feels like” in some weather forecasts) depends on humidity and wind speed as well as actual temperature; perceived colour depends on visual context (lighting conditions, nearby objects) as well as the RGB value of a set of pixels; and perceived time in expressions such as “by evening” can depend on season and sunset time as well as the clock time being communicated.
So a key issue in choosing words to express data is understanding what words actually mean to people (eg, perceived “feels like” temperature), and creating models which create this meaning from sensor data such as thermometer temperature. There is of course a lot of research in psychology and psychophysics about how people perceive the world; we should incorporate this into our data-to-word models.
Expectations and Interpretation
Many (most?) words communicate intepretations of data. The simplest example is comparing a number against an expected value. Eg, 2m is tall for a person but short for an elephant; and a resting heart rate of 100bpm is high for an adult but low for an infant.
There are more complex cases. For example, in the Babytalk project we had to detect and describe bradycardias. A bradycardia is essentially a period where the heart rate is worrying low from a clinical medical perspective. So we can only use this word to describe heart rate data if we have a model which tells us which heart rates are clinically worrying for particular patients.
I remember one time in Babytalk when I asked a doctor to describe sensor data (heart rate, blood oxygen, etc) and he said “the baby is blue“. I asked how we could say the baby is blue when he couldnt see the baby, and he responded that “blue” in this context was not a colour, but rather an interpretation of worryingly low blood oxygen level.
We differ in how we map data onto words. For example, I once told my daughter to wear her pink dress, and she responded that she didnt have a pink dress. When I pointed out the one I meant, she told me that it was purple, not pink.
This kind of thing is very common. In the SumTime project, we did a lot of investigation into this question, for example looking a how “by evening” was used differently by different people. It turned out that even ignoring sunset time issues, some people thought by evening meant 6PM, some thought it meant 9PM, and others thought it meant midnight. I wrote a paper about this (Human Variation and Lexical Choice), which I recommend to anyone who is interested in this issue.
There are also differences in how different types of people use words, and Alejandro Ramos Soto, who is doing a postdoc under me, is investigating this in weather forecasts.
Last but not least, usage of words depends on discourse context, ie what has been said previously in the dialogue. This is because human speakers align with each other in a dialogue. So if I am talking to you and describe 25C as “hot“, you may do likewise even if you normally do not consider 25C to be hot, because you are aligning with my usage.
As above, I think the problem of choosing words to express data is highly dependent on context. One thing that worries me a bit is that a lot of the current research in language grounding focuses on using deep learning (and other machne learning) to learn data-word models from corpora, in a context-independent way. I am concerned that this will lead to systems which do well under some artificial evaluation metric but are useless in practice, because they ignore context. But perhaps I am wrong about this, lets see what happens.