Can LLMs make medicine safer?

There is a huge interest in AI safety in 2024, with major efforts being made by both companies and governments, and high media coverage. And this is absolutely warranted, not least in medicine and healthcare, where bad advice has the potential to injure or even kill people.

But LLMs also have the potential to make medicine safer, which very few people seem to talk about. People (including doctors) make mistakes, and (human) medical error kills large numbers of people. In the UK, thousands of people each year are killed by medical errors; in the US, its tens or even hundreds of thousands.

As pointed out in the excellent book Black Box Thinking, a lot of the problem is cultural; one reason that airplanes are safer than hospitals is that the aviation industry has a much better safety culture than healthcare. Indeed, as pointed out by the book, many doctors show little interest in safety and reducing medical errors. Technology of course is not going to fix cultural problems! But that doesnt mean that it is useless.

Human+LLMs make fewer mistakes than humans on their own

Coming back to NLP, some of my students have seen that doctors who work with LLMs can produce more accurate reports.

Francesco Moramarco evaluated the impact of an NLG system which created draft summaries of doctor-patient consultations, in real-world production usage; doctors edited the summaries before they were recorded in the patient record. Amongst other things, he selected 20 summaries produced using the system (which had been post-edited by the doctors) and 20 summaries (from the same clinicians) which were written without the system, and analysed them for errors, including omissions. He found that none of the summaries contained critical errors, but the summaries produced by post-editing the NLG system outputs had half as many non-critical errors as the manually-written summaries.
As described in a previous blog, Mengxuan Sun is experimenting with using chatGPT to explain complex medical records, and has found several cases where the chatGPT explanation highlighted errors or confusing language in the medical record.

Of course, the problems that Francesco and Mengxuan are seeing are not medical errors that could kill patients! They are mistakes in notes written by doctors which may have little impact on actual patient care. But still, their work shows that doctors can potentially use LLMs to reduce errors.

It makes sense that humans and LLMs working together make fewer mistakes than humans (or LLMs) on their own. They make different kinds of errors. Humans (especially when stressed, overworked, and sleep-deprived, as many clinicians are) overlook things and make simple mistakes; for example many of the human errors we observed in sportswriting were due to writers copying the wrong numbers into a story. LLMs are poor at some types of reasoning and can be confused by statistically unlikely situations. So when a human post-edits an LLM output, there are fewer “human-type” errors because LLMs are less likely to make these, and fewer “LLM-type” errors because human editors will detect and fix many of these.

Discussion

I am not suggesting stand-alone usage of LLMs in healthcare, I agree completely that this has huge risks! What I am saying is that in some cases doctors can use LLMs to reduce errors, and this is absolutely worth doing if it leads to even a small reduction in medical errors.

Although the media (and big tech) play up AI systems that do impressive diagnostic tasks, most medical errors are not due to doctors making mistakes on complex diagnoses. Instead, they are due to simple human errors made by (often stressed and over-worked) healthcare staff.

To give a personal example, I was once in the emergency unit after a bike accident, and nurse A gave me a tetanus shot, which made sense. But then nurse B came along and also wanted to give me a tetanus shot. When I protested (two tetanus shots are not a good idea), she told me that there was no record of me getting a tetanus shot. In other words, nurse A did not record that she had given me a tetanus shot (maybe she was called away by an emergency), which meant that nurse B almost committed a medical error (not a life-threatening one!) by giving me a second tetanus shot.

So image an LLM “health co-pilot” which in no sense tried to be a doctor or make clinical decisions, but rather helped clinicians do routine tasks more quickly and accurately (like the summarisarisation system Franceso worked with), while highlighting the sort of human errors that Mengxuan observed, and also reducing admin/reporting workload on doctors.

I don’t think such a system would get media headlines, but it might just make medicine safer.

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

Can LLMs make medicine safer?

Human+LLMs make fewer mistakes than humans on their own

Discussion

2 thoughts on “Can LLMs make medicine safer?”

Leave a comment Cancel reply

Human+LLMs make fewer mistakes than humans on their own

Discussion

Share this:

Related

Share this:

2 thoughts on “Can LLMs make medicine safer?”

Leave a comment Cancel reply