I had an academic visitor last week, who I hadnt seen in 15 years; he had worked in NLG 20 years ago and then moved to other fields. After some general catchup, my colleague asked me what NLG research I was working on and hoped to work on. In other words, what did I personally find really exciting and valuable in NLG research in 2017? A good question, I summarise my response below. Note that I am focusing on my academic interest and research at the University of Aberdeen. I cannot say much about my work at Arria in a public forum, because of commercial confidentiality. However, most of these research topics are also commercially relevant, as described below.
Please note that this is a personal research statement, about topics which I personally find really exciting and want to work on! Other people will undoubtedly will find other topics more exciting, which is great; the NLG field will make more progress if its researchers pursue a variety of interests, instead of all jumping on the same bandwagon.
Text and Graphics
If we want to communicate information to people, when is it best to use words, and when is it better to use a visualisation? How can words and visualisations be combined to get the best of both worlds? How should users interact with a multimodal text+graphics system?
This is a really fascinating academic and intellectual question, which raises deep questions about when language is the best way to communicate, and when it isnt. Its also a very important question in applied/commercial NLG. As above I cannot say much about my Arria work here, but it is a matter of public record that I have an Arria patent on Method and apparatus for annotating a graphical output.
NLG researchers have been working on this on-and-off for at least 30 years. I myself wrote my first paper on text/graphics back in 1990, and this blog includes an entry on Text or Graphics? However, in retrospect the 1990s research on text/graphics lacked solid empirical/experimental foundations. Lots of interesting theoretical insights (hopefully including my 1990 paper) and impressive-looking systems, but not enough careful evaluation and experimentation. Fortunately, recent work in this area is much more solid from this perspective, such as Gkatzia et al’s work on the effectiveness of textual forecasts vs weather graphics for communicating weather information. I think rigorous empirical/experimental work is absolutely the way to make progress on this topic, and I hope to start working on text/graphics in an academic (as well as commercial) context soon.
In a lot of cases, the best way to communicate with people linguistically is to tell them stories. Many people believe that language evolved as a story telling mechanism, and certainly our brains seem “wired” to understand and respond to stories.
Narrative again is both a really interesting intellectual and academic challenge, and very important commercially. On the academic side, a huge amount is known about storytelling in psychology, linguistics, literary theory, and education, but little of this has been utilised by the NLG community. On the commercial side, Arria tells clients that “we help your data tell its story”, and other NLG companies similarly emphasise that they produce stories about data.
Despite the intellectual and commercial importance of storytelling, there has not been a huge amount of work in this in the academic NLG community. There has been work in the creativity community on producing fictional stories, but its not straightforward to translate this into the NLG task of producing stories about real-world data. But its exciting that we now have workshops that brings together NLG and creativity research. that’s definitely a step forward. I myself first got involved with narrative in the Babytalk project (Reiter et al 2008, McKinlay et al 2010), and am now looking for a PhD student to work on Advanced Data Storytelling.
I’ve always been fascinated by the relationship between language and the world. In particular, what do words “mean” in the context of real-world data? For example, what shapes can be described as spikes, what RGB values can be described as red, what times can be described as evening, and what motions can be described as drifting? This is surely one of the most fundamental questions about language.
The link between language and the real world is obviously hugely important from a philoophical and intellectual perspective. But although there has been a lot of philosophical discussion about this topic, its only recently that people in the NLP community have started exploring this issue from a more empirical and experimental perspective. This work is partially driven by increasing commercial interest, especially in automatic image captioning.
I worked on this area in the early 2000s, which culminated in an AIJ special issue on Connecting Language to the World, which included a good (for the time) overview paper and a paper on our work on time expressions in weather forecasts. More recently, work on grounding language has been dominated by machine learning approaches, especially for image captioning. Which leaves me with mixed feelings; on the one hand thing this is a good application of ML from a pragmatic perspective, but on the other hand I find it difficult to derive intellectual insights on language grounding from ML work. In recent years I have been working with colleagues to explore this issue from a fuzzy logic perspective, in a way which still provides intellectual insights (eg, Ramos-Soto et al 2017)
High quality testing of scientific hypotheses, including “evaluation” in NLP, is clearly of paramount importance in making scientific progress. In science as a whole, we have seen again and again that sloppy hypothesis testing may make us happier in the short term, but is likely to lead us into a dead-end in the longer term.
As any reader of this blog will quickly discover, I am unhappy with the lack of rigour in current academic NLP evaluations, especially compared to evaluations in clinical medicine. Of course one could argue that less-than-ideal evaluations in NLP are unlikely to kill people (which is the case in medicine), but I still strongly feel that we need to “up our game” and improve the quality and rigour of our evaluations. It also would be good if academic evaluations could include issues which are very important in real-world commercial NLG (and NLP), such as worst-case performance.
I have written and spoken about evaluation issues many times, for example my 2009 CL paper and 2016 NAACL invited talk. But it remains an area where progress can seem very slow, and in particular there seems to be a lot of work on details (tweaking evaluation metrics and statistical tests) but not nearly enough discussion of fundamental issues (eg, what is the proper role of evaluation metrics in NLP). There are also serious change management issues in changing the behaviour of the research community. In any case, my current work in this area includes reviewing what we know about the validity of BLEU.