Commercial and Academic Perspectives on NLG (and AI?)

I am in the interesting situation of spending part of my time on commercial Natural Language Generation (NLG) work at Arria NLG, and part of my time on academic/research NLG work at Aberdeen University. While I work on NLG in both places, there are major differences in emphasis and focus, which I suspect apply more generally to commercial and academic work in artificial intelligence.

Average vs Worst-Case Performance

Academic work, at least on 2016, usually focuses on doing a good job “on average”. Academic evaluations usually run a system or algorithm on a wide range of input data sets, calculate a performance score on each of these data sets, and then present the average of these “individual data set” scores as the system’s overall score. Which means that the focus is on doing something reasonable in most cases, without perhaps worrying a great deal about what happens in unusual boundary (edge) cases.

In the 1980s, incidentally, AI researchers were notorious for focusing on “best-case” performance, ie showing that their systems did a good job on a handful of carefully selected examples. Fortunately the field has moved on, and “best case” performance is no longer regarded as an acceptable evaluation.

In the commercial world, sometimes focusing on average performance makes sense. For example, Google Translate tries to do an OK job on average, but it occasionally produces a really poor translation; this is regarded as acceptable. However, in many commercial NLG applications, it is essential that the system produce reasonable texts in *all* cases. For example, a medical decision-support system cannot produce texts that decrease the quality of patient care. Such systems are evaluated by their “worst case” performance as well as their “average case” performance.

One result of this is that commercial companies such as Arria put much more effort into testing and quality assurance than academic research projects. This is one of the biggest differences I find “on the ground” when working on NLG projects at Arria and at the university. At Arria, nothing gets released without going through an extensive test/QA process, which (amongst other things) is explicitly designed to check behaviour on difficult edge/boundary cases, and ensure results are acceptable in these cases. At Aberdeen University, we certainly spend time debugging software, but in a less organised fashion, and we usually do not explicitly check behaviour on unusual edge cases.

Interesting vs Common

Academic work often tends to focus on things which are interesting because they lead to fundamental insights about computation, language, etc; it is less important how common such phenomena are. A good example is the extensive research on donkey sentences (such as “Every farmer who owns a donkey beats it”), despite the fact that such sentences are very rare. This is because studying such sentences leads to deep insights about the semantics of language.

In contrast, commercial work often tends to focus on things which are common and important in an application domain, even if they do not reveal deep insights. For example, predictive text is very useful and commercially important, but probably does not teach us a lot about language or computation.

A good example in NLG is generating referring expressions. I have worked on this both academically and within Arria. My academic work, like most academic research work on reference in NLG, has focused on choosing attributes in definite NPs. For example, should I refer to a specific chair as the big chair, the black chair, or the big black chair. Very sophisticated models and algorithms have been developed for this task (which are summarised in a recent survey article). This work has led to many deep insights in NLG, on topics such as conversational implicature and appropriate evaluation techniques as well as reference per se.

However, from a commercial perspective, choosing attributes in definite NPs is not a high priority, because definite NPs which contain attributes (ie, are more than just specifier and head noun, such as the chair) are not very common. Also when they do occur, very simple attribute-choice algorithms are often adequate. Thus in commercial contexts, I don’t spend much time thinking about reference to definite NPs, instead I think about types of reference which are important in Arria’s systems, such as references to names (eg, Arria NLG vs Arria), dates (eg, yesterday vs 19-Dec-2016), and components of complex machines (which Arria has patented).

Flexibility

Most academic work focuses on solving a problem in a specific context. For example, participants in a Shared Task challenge focus on getting their algorithm to work on the specific data and context specified by the shared task. In an NLG context, my academic projects have also focused on specific data sets and contexts. For example, Babytalk focused on the Neonatal Intensive Care Unit (NICU) at Edinburgh Royal Infirmary, and the specific electronic patient record system (a customised version of Badger) which was used in this unit. We did not in Babytalk seriously look at deploying our algorithms and systems in other contexts, not even in other NICU units in Scotland. Doing so would have required a very large amount of extra software engineering work which would not have any academic benefits (ie, we couldnt publish research papers about this work).

In the commercial world, in contrast, systems usually need to be flexible and deployable in different contexts. There are some exceptions, but in most cases we want modules which can be used in many places. For example, looking again at the Babytalk context, a NICU NLG report generator which could easily be integrated with NICUs in hundreds of hospitals is a far more appealing commercial prospect than a NICU NLG report generator which only works in one hospital. Indeed, even within one hospital, a flexible NLG system (in this sense) is more useful because it will be easier to adapt and maintain as the hospital itself changes over time, including bringing in new IT systems.

Flexibility largely comes from good software engineering on things like data connectors. But it also impacts NLG algorithms, because we need algorithms which work well in different contexts (domains, genres, applications), which amongst other things means we need algorithms which can easily and robustly be parametrised for different contexts. This is not something the academic community has focused on.

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

Commercial and Academic Perspectives on NLG (and AI?)

Average vs Worst-Case Performance

Interesting vs Common

Flexibility

Leave a comment Cancel reply

Average vs Worst-Case Performance

Interesting vs Common

Flexibility

Share this:

Related

Share this:

Leave a comment Cancel reply