AI in Healthcare

Real-world safety and harms from patient-facing LLMs

Last week I asked some people via social media and direct emailing if they were aware of papers and data about real-world harms caused by AI chatbots in health contexts. I got some very interesting responses, which highlight three sources of data about real-world harms: incident reporting, clinical trials, and data from health providers.

By real-world harms, I mean studies and data about harms caused to people who use chatbots to get information about their own health. I exclude studies which ask participants to pretend to be patients in a scenario, such as Bickmore et al 2018 and Bean et al 2026. These are really interesting, but it is not clear that interactions with pretend patients are similar to interactions with real patients (who care much more about the information, probably have existing knowledge and preconceptions, etc); certainly in other contexts I have seen that real patients behave differently from participants who are pretending to be patients.

Incident Reporting

Harmful incidents involving medication or medical devices (and indeed consumer products more generally) must be reported to the government in the UK and many other countries. At the time of writing, there is no comparable requirement for harmful incidents with chatbots, but several websites list incidents which are reported to them. These include

  • HumanLine (https://www.thehumanlineproject.org/): Shows stories about chatbot harm which were submitted by viewers and/or publicised by the media. I dont know if the stories are representative, but they are certainly moving!
  • IncidentDatabase (https://incidentdatabase.ai/): A database of incidents involving AI harm. Much larger than HumanLine, but less carefully edited
  • Case report of AI harm: These are also reports of individual cases which appear in peer-reviewed academic literature; these are probably more accurate than reports in the above websites. For example Moore et al 2026.

Incident reports are very interesting and show what can go wrong with AI chatbots. However since reporting for chatbot incidents is not mandated by governments, coverage is very uneven. For example, most people are not aware of HumanLine or IncidentDatabase, and hence will not submit reports to these databases.

Clinical Trials

Clinical trials provide the strongest evidence of medical impact, including harms and safety concerns. Unfortunately most trials I have seen which look at chatbot harms in health dialogues use pretend patients (as described above), not real patients.

One exception is Brodeur et al 2026, which describes a trial of Google’s AIME system for patients. 100 patients used AIME a few days before a clinical appointment, to prepare for the appointment. The AIME sessions were monitored by physicians who could stop the session if they had concerns about harm or emotional distress. No sessions were stopped, but in 3 cases the monitoring physician gave the patient additional information about safety issues or corrected (minor) AIME hallucinations.

I hope to see more such studies in the future!

Data from health providers

Probably the best way to get representative data about chatbot harm across patients is for healthcare providers to give data about harms they have encountered in their patients. One such paper is Olsen et al 2026, which analyses descriptions of chatbot use in psychiatric clinical notes from 1.5 million people in Denmark, between Sept 2022 and June 2025. They found 38 cases where chatbot use caused harm, for example by consolidating delusions or reinforcing mania; such cases involved less than 0.1% of the patients who used the service over the period examined.

I would love to see more such data! It is statistical in nature rather than detailed, but provides good coverage across patient demographics; in a sense it is the opposite of incident reports. I appreciate that it is challenging to provide such data because of confidentiality and data protection issues, and furthermore some health providers may be reluctant to share such data because of concerns about legal or reputational risk. But if it can be provided, it is very useful!

Aside: Harms from Chatbots vs Social Media

I recently read Cohen’s excellent book Bad Influence: How the Internet Hijacked Our Health, which describes harms seen in people who get health information from social media. It struck me that some of these are similar to the chatbot-induced harms described above. At a high level, in both cases key problems are misinformation and overuse (addiction in some cases) of the tool. It is worth noting that the tech companies that dominate social media (Google/Youtube, Facebook/Instagram, ByteDance/TikTok) are also prominent in AI chatbots (ByteDance’s Doubao is not well known outside China, but it is the most popular chatbot in China).

On a more positive note, misinformation is inherent in social media, since anyone can post just about anything, but we should be able to engineer bots to reduce their misinformation and harmful behaviour. So if people who currently use TikTok for health information switch to ChatGPT or Gemini, this is probably a “win” from a safety perspective.

Final thoughts

Hundreds of millions of people are using AI chatbots in health contexts, so it is very important that such chatbots be safe and do not harm patients. Unfortunately we do not have a lot of data about the safety of AI chatbots when used by real patients, but there is some, and the evidence base is growing. The limited evidence which we have suggests that while AI chatbots are usually safe, in a few cases they can harm people, especially from a mental health perspective.

We badly need better data about real-world safety risks of health chatbots, in order to design safer bots. One thing that would help is if governments legally required vendors to formally report safety issues with chatbots, and also made it easy for users to report safety issues. Indeed perhaps existing mechanisms for product safety could be adapted for and extended to chatbots. At the moment, usage data and feedback from chatbot users just goes to the vendors, who treat this as sensitive commercial data which cannot be shared (blog); this is not ideal from a safety monitoring perspective. Another possibility is more monitoring of health records, as done by Olsen et al 2026.

However, despite the above concerns, we should also keep in mind that AI bots can really help people with health concerns, especially if the alternative is getting health information from TikTok. So lets make them safer, abandoning them would be a mistake!

Leave a comment