Can LLMs make medicine safer?
There is a lot of justified concern about the risks of using LLMs in healthcare. But LLMs can also make medicine safer, if they are used to support doctors and help them make fewer medical errors.
There is a lot of justified concern about the risks of using LLMs in healthcare. But LLMs can also make medicine safer, if they are used to support doctors and help them make fewer medical errors.
One very positive aspect of 2023 for me was that I saw lots of really interesting research papers, much more than in previous years. Perhaps because the emergence of LLMs have encouraged some people to move away from scientifically dubious leaderboard chasing and towards more interesting research on scientific fundamentals? I describe a few of these papers here.
I summarise a few papers I have recently read on what LLMs can and cannot do. One (not surprising) finding is that LLMs skill profile is very different from humans. Which is good, means that human+LLM together can do things that humans/LLMs cannot do on their own. Also means that it makes little sense to evaluate LLMS using tests/techniques designed to evaluate people.
There is a personal rant (no connection to AI or NLP) about the fact that many decision-makers dont seem to care about climate change. How else to explain UK ban on new onshore wind in England, and refusing to shift Bitcoin to more energy-efficient algorithms?
It is very rare to see evaluations in the NLP research literature which are based on measuring the impact of systems on real-world users. I’d love to see more such evaluations, and describe some ways of doing this, along with a few examples.
The easiest eay to measure hallucinations is to ask Turkers to count incorrect statements in a text. However reproduction papers published at the Human Evaluartion workshop suggest that this is not a reliable way to measure hallucination. Hopefully researchers will switch to better ways to measure hallucination!
One of my students is investigating mistakes made by chatGPT in explaining medical notes. Some of these turned out to be due to problems in the notes being explained (ie human error by the doctors who wrote the notes) more than LLM deficiencies.
At its best, peer review can significantly improve the quality of papers; its not just an accept/reject gate. I describe a few examples where peer review has led to major improvements in the quality of my papers.
Generated texts can (unnecessarily and unintentionally) upset people. I describe several examples I’ve seen of this. This blog is based on a talk I gave at a SigDial workshop.
Many people asume that “human evaluation” means asking people to rate or rank outputs. However there are many other types of human evaluation, most of which give more meaningful results than rating or ranking! I discuss some of these, including task-based evaluation, annotation-based evaluation, and real-world evaluation.