building NLG systems

We need to understand what users want!

A key goal of NLG research is to lay the foundation for building better NLG systems. In 2022, I see a lot of research on improving algorithms and models, and also increasing research on data sets and evaluation, all of which should lead to more useful systems. But one thing I dont see much of is research on understanding what users want NLG systems to do. Which is a pity, because a better understanding of user requirements can absolutely lead to more useful systems! This is true for databases and ecommerce sites, and its also true for AI, NLP, and NLG.

Below I give some concrete examples from projects where I have personally seen that a deep understanding of user needs led to better systems. I could also give *many* examples of projects where a poor understanding of user needs led to NLG systems which were useless in practice, but I’ll stick to positive examples in this blog.

Weather forecasts: What do forecast readers want?

20 years ago I was working on the SumTime system to generate marine weather forecasts. I was very interested in lexical (word) choice, so I did a lot of analysis on how human forecasters used words, by analysing a parallel corpus of weather prediction data and (human) forecasts written from this data. What I discovered, much to my surprise was that there was a lot of variation in how different forecasts used some words. For example, some forecasters primarily used “by evening” to refer to 1800, while others primarily used it to refer to 0000 (Reiter and Sripada 2002; Reiter et al 2005).

We then talked to some forecast users, and discovered that they really disliked this. Essentially each day a different “duty forecaster” wrote the marine forecasts for a region, and this variability made it harder for the forecast users to correctly interpret the forecasts. The forecasts users also complained that the usage of some terms, such as “later”, was different in marine forecasts and in other types of forecasts they used, such as shipping forecasts; this was also confusing to them.

Based on this user feedback, we then identified a set of words (such as “by midnight”) which were used reasonably consistently by human forecaster writers and understood by human forecast readers, and programmed SumTime to use these phrases. When we evaluated SumTime with forecast users, we found that in some cases they actually preferred SumTime forecasts to human-written forecasts, in part because of clearer word choice! Claims of super-human performance are not unusual now, but in 2005 this was amazing, very few systems could claim better-than-human performance supported by good quality human evaluations.

In short, we built a high-quality word choice module which led to high quality texts. But the quality of the model came not from its algorithms (which were very simple, eg lookup tables) but rather from our fieldwork with forecast users supported by extensive linguistic analysis of corpus texts.

Consultation summaries: What do doctors want?

Jumping ahead to 2022, one of my PhD students, Francesco Moramarco, is working with a healthcare company on generating summaries of consultations between patients and doctors and other clinicians. In the UK (and many other countries), when a clinician talks to a patient, a summary of that consultation needs to be entered into the patient’s medical record. Currently this is done manually, either by the clinician or by a medical scribe. The goal of the project Francesco is working on is to use NLP technology to automatically generate a summary of the consultation, which is checked and edited by the clinician before it is committed to the patient’s medical record.

As part of this project, Francesco and his colleagues on the project did extensive work to understand how clinicians worked and what they wanted the system to do (Knoll et al 2022, which won an award at NAACL for best paper on human-centred NLP), including

  • Analysis of pre-automation behaviour: They interviewed seven clinicians to understand how they currently took notes and wrote summaries, and what problems and concerns they had with the current process. From this they created several personas which represented different behaviours in this space, including Touch Typer (writes summaries during consultation), Sketcher (writes outline/sketch during consultation and expands it afterwards), and Doodler (writes notes during consultation, and then writes summary afterwards, referring to notes).
  • Low-fidelity mockups. They asked clinicians to try three low-fidelity mockups: (A) summary shown at end, (B) transcript shown during consultation and summary shown at end; (C) summary shown during consultation, incrementally growing in real-time as consultation progressed. The doctors strongly preferred (C), real-time summarisation of the consultation so far. This was a key finding, especially since previous work on text summarisation emphasised producing a summary of a full text, not incrementally summarising a dialogue (consultation) as it proceeded.
  • Wizard-of-oz study: Clinicans used a “prototype” system where a human “wizard” wrote the summary, instead of software. Among other things, this study emphasised the importance of summarising consultations in as close to real-time as possible.
  • Live test: When the summarisation system was ready, an initial cohort of 5 clinicians used it for three weeks. The results were encouraging, in particular usage of the system increased over time as clinicians became more familiar with it and gained trust in the system. Perhaps not surprisingly, clinicians reverted to manual writing during difficult consultations.

All of the above techniques are relatively standard in HCI, but very few researchers have discussed how to use them when building NLP systems. I hope this paper encourages NLP researchers to explore how to best use such techniques, and indeed how to adapt them to an NLP context.

Research Agenda

It sounds trite, but it remains 100% true that we can build better NLP systems if we understand the language and system behaviour that users prefer, as illustrated in the above case studies. I suspect that in a lot of cases, understanding user requirements will provide more benefit than tweaking models or even data sets, and I would love to see more NLP researchers working in this area.

More specifically, I would love to see research on

  • How should HCI techniques for understanding user needs be used in NLG? Very little has been written about this, I’d love to see more papers on this topic!
  • What kind of language do users actually want to see from NLG systems? We tend to assume users mostly care about fluency and accuracy, but we should investigate this rather than make assumptions. For example, it may be that in some cases users also care deeply about consistency and conformance to corporate brand standards.
  • How important is real-time incremental generation? The NLG research community has pretty much assumed that its task is to generate a complete output from a complete input, but as mentioned above there are contexts where generation should be incremental as more data becomes available. We should work with users to properly understand this.
  • How do users want to interact with NLG systems? In the research community today, I see a lot of work on chatbots and multimodal interfaces, but I dont see a lot of work on letting users control and configure NLG systems. I suspect that control and configuration is important in lots of contexts, lets work with users to properly understand this.
  • How much do users care about average case vs worse case behavour? The academic community is 99% fixated on evaluating “average” performance of systems, but my suspicion is that a lot of users are just as interested in “minimum performance” guarantees, which are basically about worst case behaviour. As above, would be great to properly investigate this!

Of course above list can be expanded! But the key point is that we should properly investigate such issues, and the way to do this is to work with users to understand what they want NLG systems to do. We’re certainly not going to make progress on above by chasing leaderboards!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s