evaluation

Texts can be accurate but still not appropriate

I recently read “Large Language Models Encode Clinical Knowledge” (https://arxiv.org/abs/2212.13138), where a team from Google showed how a modified version of Palm seemed to do really well in responding to medical questions. I was impressed by the thorough evaluation, which included asking a panel of clinicians to rate texts on 12 different criteria, and comparing MedPaLM texts to texts written by clinicians.

Then I showed some of the example MedPaLM outputs to a doctor (GP) I know, and she said that the texts were inappropriate because they could scare and worry patients, and she would never use them. For example, Table 9 in the paper gives the following MedPaLM text:

Ear pain can be a sign of several underlying conditions, including: Middle ear infection (otitis media), which is commonly caused by a viral infection and can cause pain, redness, and swelling of the ear. External ear infection (otitis externa), which is commonly caused by a bacterial infection and can cause pain, redness, and swelling of the ear. Earwax impaction, which can cause pain, a feeling of fullness in the ear, and decreased hearing. Ear pain can also be a sign of a more serious condition, such as a brain tumor or a stroke. If you have severe ear pain that lasts longer than a few days, or if you have other symptoms along with the ear pain, such as dizziness, fever, or facial weakness or numbness, you should see a doctor for evaluation. The doctor can determine the cause of the pain and provide appropriate treatment.

MedPaLM response to “How do you know if ear pain is serious?”

The doctor said she would never mention brain tumour or stroke in this context, because it would cause high levels of anxiety to her patients, which would not be warranted in the vast majority of cases (it is fine to tell patients to see a doctor if symptoms persist or get worse).

I told her that these texts had been evaluated under many criteria by a panel of clinicians, so she asked what the criteria were. The 12 criteria are listed on Table 2 of the paper, and the only one which is relevant to the above issue is criteria #10, inappropriate/incorrect content . Mentioning brain tumours as a potential cause is not incorrect, but it is inappropriate. Interestingly enough, this is also the criteria which MedPaLM does worst on, according to the paper.

I guess one thing that bothers me is that the evaluation does not highlight or even mention the above concern about the text. If a domain expert has a serious and well-justified concern about a generated text, I would expect this concern to be mentioned and acknowledged in the course of a careful evaluation with (other) domain experts. But this did not happen here.

Accurate but not appropriate

I guess this is a minor incident, but it does highlight that content can be inappropriate even if it is accurate. Accuracy (which is related to hallucination) is a huge problem for neural NLG, and over the years I’ve published a number of blogs and papers about evaluating accuracy; much of this work is summarised in our 2023 journal paper Evaluating factual accuracy in complex data-to-text. Some of my students have also looked at omission, where key information is not included in a generated text. I’ve usually assumed that texts are acceptable from a content perspective if they do not contain inaccurate (hallucinated) content and do not omit key insights.

However, the above example shows that this is wrong; content (such as mentioning brain tumours in above context) can be inappropriate and harmful even if it is factually accurate! In this example the inappropriate content causes unnecessary stress and anxiety. I’ve seen other examples where texts are inappropriate because of their psychological impact. For example in Babytalk, we acknowledged that it might be a bad idea to tell an elderly grandmother with a weak heart that her baby grandchild was doing poorly and might die (this is discussed in van Deemter and Reiter 2018). In behaviour change, psychology tells us that positive messages are usually more effective than negative ones. Hence, if we want to encourage people to stop smoking, eat more healthy food, or drive more safely (I’ve worked in all of these areas), we don’t want to give too much negative information which criticises the user’s actions, because this may lead to the user “switching off” and ignoring what we tell him.

Of course another area where accurate content can be inappropriate is when content is unacceptable because it is offensive, racist, profane, etc; this is often called safety (paper). Again, in many (not all) cases, safety issues are partially about the negative psychological impact of texts.

Last but not least, it is possible for texts to be accurate but misleading. In financial reporting (which is a major use case in commercial NLG), for instance, changing the time period can have a huge impact on content. For example, if a system is reporting on Tesla’s stock performance, it could say (at the time of writing) Tesla stock has quadrupled in value over the past three years; it could also say Tesla stock has lost two-thirds of its value over the past year. Both of these statements are accurate, but the first (and perhaps the second) on its own would be misleading and hence inappropriate.

Final thoughts

We need to acknowledge that content problems are not limited to accuracy and ommision; content can be inappropriate even if it is accurate! We need a better understanding of these issues, and also to ensure that our evaluation criteria capture such problems.

I also wonder whether we’ll see more of this kind of issue with texts generated by large language models like PaLM and GPT. Above is less of a problem in rule-based NLG (where we can easily adapt content rules), but its not clear (at least to me) how we could stop a large language model from adding inappropriate context such as the above,