I’ve always been very interested in healthcare applications of NLG; communicating complicated information is a core activity in healthcare, and NLG can help with this! Especially in patient-focused contexts, where I believe NLG’s ability to communicate and humanise data can help people take better care of themselves. I’m currently working with 4 PhD students (and an MSc student) on health-related applications, and we have funding for a fifth PhD student, which hopefully we’ll be able to advertise soon. I’m also helping a spinout company (MIME Technologies) which is selling tools to help airline cabin crews deal with in-flight medical emergencies.
Anyways, one advantage of having several projects in the same area is I can start to understand generic problems and “pain points”. In health NLG, three recurrent “pain points” are data, evaluation, and safety. My students have lots of great ideas for models and algorithms which will help people understand and act on medical data. However, in order to properly explore these, they need to get access to relevant health datasets, run evaluations which measure real-world impact of their ideas, and ensure that their systems never say anything which is dangerous and could lead to injury. None of this is easy, but they cannot test their ideas and hypotheses without addressing these issues.
Nothing is as frustrating as getting access to good data sets. I had a real “success” moment last week, when we managed to get access to an excellent high-quality data set for an MSc student after two months of discussion and negotiation. But I’ve also had failures, which we’ve had to drop research directions (sometimes after months of work) because we couldnt get access to the data that we needed. In most such cases, by the way, the data sets existed, but we couldnt use them because of data protection or commercial issues. We’ve also built our own data sets on a few occasions, but this can be a huge amount of work.
Incidentally, I still remember the frustration I felt when we had to delete the datasets we used and developed for the Babytalk project. This was an awesome data resource which could have supported lots of valuable research after the project ended, but the hospital we were working with asked that we delete it when the project finished, so we did so.
Anyways, there are some high-quality health data sets which are available to the research community, such as MIMIC. But I’ve only once been able to benefit from such a dataset (and in this, the dataset was produced by colleagues at Aberdeen’s Medical School), perhaps because of the research topics I am interested in. Of course, we could focus our research on areas where data is available, and I suspect this is what a lot of researchers do. But if everyone does this, this will really limit the research areas and questions we can investigate, which I think would be a real shame.
I have written dozens of blogs about evaluation, so I’ll just say here that if we want the medical community to seriously consider using our NLG health systems, we need to demonstrate that they have an impact on some real-world outcome which is important to the health community. The most obvious (and best) outcome is better health, but there are other useful outcomes as well, such as improved emotional state (eg less stress), higher patient satisfaction with the healthcare system, and saving time and money. But regardless, we do need to show some kind of real-world outcome; BLEU scores or “Do you like it” Likert ratings are not sufficient.
Just to be clear, in an NLG research project, especially one done by a student, we dont normally need to do a full clinical trial! But we should provide enough evidence of effectiveness that a clinical trial can be justified as a followup project.
The above means that my students and I spend a lot of time and effort on evaluation. We can easily spend several months on designing an evaluation, getting ethical approval, carrying out the evaluation, and analysing results. Which can be a pain (especially when things go wrong and you need to throw out data or restart the evaluation, which has happened to me).
Last but not least, in health contexts we need to ensure that our systems never generate texts which could negatively impact health, because they are confusing our wrong. It is not acceptable for an NLG system to hurt people, even if this only happens once in a blue moon!
Safety is a tough issue, and I’m glad to see it being discussed more by the research community (and I’m looking forward to the upcoming Sigdial Safety in Conversational AI session). At the moment, our approach to this is pretty crude. Basically we try to avoid applications with inherent safety concerns and narrative content which could potentially be unsafe, we avoid using machine learning where we think it raises safety issues, and we do extensive testing. I’d love to have a more sophisticated approach to safety, and I look forward to learning more about how others deal with this in issue (in all contexts, not just health).
Last week I talked to someone from EPSRC (main UK funder of CS research) about the non-academic impact of my work. I told her that most of my impact to-date was economic, but I really hoped that I would be able to point to health impact in 5-10 years time.
I think this can happen, because NLG has tremendous potential to improve health, especially in contexts where patients are expected to look after themselves but are confused by complex medical data. But coming up with clever NLG models and algorithms in some sense is the “easy bit”; the real pain points are data, evaluation, and safety. If we want to make progress in health NLG (and indeed in many other areas of applied NLG), we need to resolve or at least “lessen the pain” of data, evaluation, and safety.
One thought on “Pain Points in Health NLG: Data, Evaluation, Safety”