Last week I attended a really interesting but also very worrying workshop on Safety for Conversational AI. Most of the workshop was devoted to topics such as stopping inappropriate language (profanity, hurtful stereotypes, etc), which of course is very important! But what made the biggest impression on me was a talk by Timothy Bickmore (based on a paper), where he pointed out that conversational assistants potentially could severely injure or kill people with inappropriate medical advice. Bickmore created a number of scenarios and asked subjects to go through these scenarios with Alexa, Siri, and Google Assistant. In some cases, following the advice given by the assistant could have led to death.
This was part of an experiment which used made-up scenarios which were deliberately designed to be difficult; no one was killed in real life!!! But still, Bickmore’s work shows that it is conceivable that users of an NLP AI could potentially die because of what the system told them.
Which led me to wonder, is it conceivable that users of an NLG systems could be injured or even killed because of what the system told them? Certainly some of my research projects have encountered situations where NLG systems could potentially harm users, and I can think of situations where (if the system was deployed) death was conceivable.
SkillSum: upsetting users
Around 15 years ago, Sandra Williams and I were working on SkillSum, which was an NLG system which gave feedback to adults with poor literacy and numeracy about their skills (based on an online assessment) and how it impacted their career goals. For example, if someone wanted to be a plumber but could not multiply, we might tell him that he had problems multiplying, point out that plumbers need to multiply in order to figure out pipe dimensions, and recommend that he sign up for a numeracy course.
We evaluated SkillSum with 230 students at a Further Education college who were considered to be at risk of having inadequate literacy or numeracy. On average, SkillSum helped the students understand the problems, and most students found the feedback useful. However, 2 of the 230 students got upset, and one started crying. The system was telling them unpleasant truths about their lack of skills and how this impacted their career hopes, and while some students preferred to hear bad news from a computer, others did not want to hear bad news from an uncaring machine.
Because this was a research experiment, we had tutors on hand who could reassure and comfort the students. But if SkillSum had been used online without any backup from human tutors, some of its users might have gotten very depressed. In the worse case, if a user was already suicidal, could SkillSum have pushed someone “over the line” into committing suicide?
Roadsafe: unsafe decisions
A few years after SkillSum, Ross Turner, who was then a PhD student at Aberdeen, worked on RoadSafe, which generated weather forecasts to support people who were putting grit on the road to prevent icing. There is no way that a few paragraphs of text can precisely identify everywhere in a region where icing may happen, so RoadSafe had to use approximate descriptions such as “far Southern regions”. Which raised the danger that if the description was too vague, or misunderstood because of lexical variability (eg, “far Southern” means different things to different people), could this lead to the road engineer not putting grit on a place where the road would ice, with this in turn contributing to a serious car accident?
Ross was very aware of this danger and did his best to minimise it. Also, RoadSafe was a research project, it was not used operationally. And even if it had been deployed, road engineers look at several data sources and presentations and use their years of experiemce when deciding where to grit, they wouldnt just look at RoadSafe-type weather forecasts. But still, whenever we deploy a system such as RoadSafe which supports decision making in contexts where a wrong decision could lead to injury or death, there is a danger that misunderstood NLG texts could encourage poor decisions.
Babytalk: bad news triggers heart attack?
Around 10 years ago, Wendy Moncur was doing a PhD in the context of the BabyTalk project, where she essentially looked at producing reports about a sick baby (in intensive care) for friends and family (paper). Wendy did a lot of user studies, and discovered that some parents were worried that bad news about a baby could trigger a heart attack in an elderly grandparent or other relative. In other words, if someone who is already at risk of heart attack hears the very upsetting news that a baby may die, this could trigger a heart attack and potentially even kill the relative.
Kees van Deemter and I discuss the problem and context in a recent paper, which uses this example to argue that in some cases ethical NLG systems may need to lie. But in any cases, Wendy’s work (again the system was never fielded) suggests that an NLG system which communicates very upsetting information could be dangerous to its users.
Recent examples: diet advice, causality
The above examples are from work my students did many years ago. But some of my current students are also seeing cases where there is potential for harm. For example, Simone Balloccu is looking at building an NLG app which gives people information about their diet. User studies suggested that people want advice on changing their diet, so Simone added dietary suggestions. However, in a pilot experiment, one subject pointed out that the advice was dangerous for him, because he was diabetic and the system had not taken this into consideration. Of course allergies and other medical conditions can also influence diet, so Simone removed the dietary suggestions from his system.
Another student, Jaime Sevilla, is looking at adding causality information to explanations of AI models. Jaime recently pointed out to me an example (extracted from a new book) where an explanation of a model built from medical data which ignored causation could suggest that having asthma reduces the risk of dying from pneumonia. This is wrong, and could be dangerous if it influences how such patients are cared for.
We need to think about the worst case in the messy real world!
The above examples are all in research projects, I am not aware of any cases where NLG systems have actually injured people! But if we want NLG technology to be successfully used in complicated and messy real-world contexts, where texts are read and used by real people, then we need to think about the “worst case”, and do our best to eliminate or at least reduce the risk of harm. Especially in applications which are either safety-critical (like decision support for road icing or medical treatment) or potentially involve communicating emotionally upsetting information (like assessment results or updates on a patient’s status).
From this perspective, I am disappointed that many researchers do not seem very interested in worst-case performance. Craig Thomson and I are doing some work on detecting accuracy problems in generated texts (some of this will be presented at INLG 2020), and one thing that has depressed me is that we saw cases where a system sometimes generated truly awful texts (nonsensical and incomprehensible), but the papers describing the system said very little about this. Certainly concrete examples of worst-case incoherent texts are rare in papers about NLG systems.
Such attitudes need to change. It is essential to understand how bad our generated texts can be in the worst case, and what harm these texts can do. And this needs to be clearly explained to potential users and also to fellow researchers.