The topic of how end users feel about and react to NLG came up twice in the past few weeks in discussions with PhD students. Its a good question and not easy to answer, I’ll write down some thoughts here. As always, I cannot discuss my experiences at Arria in this blog, which is a real pity in this case; there are also limits on what I can say publicly about my experiences at Aberdeen University. But I will make a few general points and also repeat what people I respect have said to me.
User Reaction is Not the Same as Evaluation!
The first point is that end user reaction is **not** the same as evaluation. In broad terms, evaluation is about testing hypotheses about how effective a system is. End user reaction of course depends on effectiveness, but it also depends on workflow, fears about job security, and impact on organisational structures. In other words, it is heavily influenced by change management issues. The form of evaluation which comes closest to understanding the reaction of end users is Human Ratings in Real-World Context, because we’re evaluating with real users (instead of Turkers, students, etc) in real-world context and workflow, and also soliciting opinions (ie, user reactions) rather than measuring task performance (and good task-performance in some cases might lead to concerns about job security). But even here, such studies may be biased because they often work with end users who are enthusiastic about technology and dont feel threatened jobwise. Since people who are not enthusiastic or feel threatened are often reluctant to participate in evaluation experiments.
Advice from Anderson Consulting
In the 1990s, I worked with some people from Anderson Consulting (now called Accenture) who had a lot of experience in medical informatics. They explained to me that at least in medical contexts, developing effective techniques for enhancing patient outcomes was the easy bit; the hard part was getting the medical community to adapt these new techniques. Especially if the techniques changed workflow, threatened jobs, or changed the way organisations got paid. For example, its relatively straightforward for hospitals to adapt a new and more effective drug, since this doesnt change workflow, etc. But its harder for a large number of hospitals to adapt new technology for decision support for doctors, because this changes workflow.
Daniel Kahneman pointed out in his excellent book Thinking: Fast and Slow that we have had medical diagnosis algorithms since the 1950s which outperform average doctors, but takeup of this technology has been very slow. This is probably because of the issues mentioned above, as well as an understandable reluctance on the part of doctors to accept responsibility (and indeed malpractice liability) for decisions made by an algorithm they did not understand.
Advice from Rob Milne
In the early 2000s I was fortunate to work with Rob Milne, before his untimely death climbing Mt Everest. Rob was a pioneer in building commercial AI systems for heavy engineering industries, and I still remember one piece of advice he gave me, which was to focus on fully automating peripheral tasks. People react much more positively to automating peripheral tasks they dont care much about, compared to automating core tasks which are essential to their self-image. Going back to medicine, a doctor may welcome automation of reporting or documentation tasks but be resistant to automating diagnosis tasks. This because diagnosis is a core aspect of medicine and he is proud of his skill and expertise in diagnosis, while writing reports is an unfortunate aspect of medical life which we would love to offload onto a piece of software.
Rob also stressed that fully automating tasks was much preferable to partial automation, where the computer produces a draft which the user needs to check and post-edit. If the user has to check and post-edit a document, he may decide its less hassle to just write the whole thing himself, without bothering with the software system. However, I think it is OK (this comes from other people, not from Rob), for an AI system to automatically generate a document 90% of the time, but once in a while say “I cant do this” and ask the human for help (and the human can either check/postedit or write from scratch). After all, cases the software struggles with are likely to be more interesting for a person to tackle. Of course, this approach only works if the NLG or AI system is able to reliably assess the quality of its output.
SumTime and Meteorologists
Some of the above points were illustrated by our experiences when a local weather company used our SumTime weather forecast generator for a few years. SumTime’s forecasts had been rated very highly in an evaluation with forecast readers (Reiter et al 2005).
The weather company asked human forecasters to postedit the computer-generated forecasts before they were issued and sent to clients. We analysed the postedits (Sripada et al 2005), and essentially discovered that while most of the forecasters only occasionally post-edited the computer texts (presumably when they saw problems), one of the forecasters edited almost everything, and indeed often just deleted the computer text and wrote a new forecast from scratch. This forecaster was one of the most experienced forecasters at the company, who took pride in his ability to produce well-crafted forecasts. In other words, while many forecasters view their core task as analysing the weather, and regard actually writing weather forecast texts as a peripheral task, this individual regarded writing forecast texts as a core task which he was very good at. Hence its perhaps not surprising, following Milne’s guidelines, that this forecaster was more reluctant to accept NLG texts than colleagues who regarded writing forecasts texts as a distraction from their main job, which was analysing weather data.
This incident also shows the difference between evaluation and user reaction. As discussed in our paper, some of the postedits that this individual made seemed to be based on individual stylistic preferences which probably did not impact the utility of the forecast to the forecast reader. Ie, these postedits were made because the forecaster preferred a certain style, not because they enhanced utility as measured by reader evaluations.