Someone recently asked me why the Babytalk system is not being used operationally. It was a successful research project, so why didnt it enter everyday usage? I suspect the answer to this question is obvious to my commercial readers, but perhaps less obvious to my academic readers, so I will address this blog to them.
To give some background, the Babytalk project developed several NLG systems that summarised clinical data about babies in neonatal intensive care units.
- BT45: short summaries intended to support real-time decision making (“The baby is having some real problems now, what should I do”)
- BT-Nurse: multipage summaries to support nursing shift handover (“I have just come on shift, what do I need to know about the babies I am looking after”)
- BT-Family: page-long summaries for parents (“How is my baby doing?”)
This work was done in conjunction with the neonatal ICU in the Edinburgh Royal Infirmary. In terms of real operational usage
- BT45 was never operationally used.
- BT-Nurse was operationally used for about a month as part of a formal evaluation, but was not used outwith this period.
- BT-Family was deployed for a few years operationally, but stopped being used after a major IT change which would have required significant updates to the software.
So why wasnt Babytalk used more? There are a number of reasons for this, some of which I can discuss publicly, some of which I can’t discuss in a published blog but can orally discuss “off the record” (which I do with our MSc in AI students), and some of which remain confidential. I’ll discuss what I can here. If you want to know more, feel free to enroll in Aberdeen’s MSc in AI!
When Babytalk was coming to a close, we discussed commercialising it with a medical informatics company (not Arria, Babtalk predates Arria). The company quite naturally asked how much it would cost to get Babytalk running in other hospitals, and we responded “a lot”. The company also asked how much effort would be needed to keep Babytalk running in Edinburgh if IT systems or clinical practices changes, and we said “quite a bit”. They then decided, not surprisingly, not to continue this discussion.
Babytalk, like many research projects, was designed to work in one environment. Our research goal was to develop data-to-text technology and evaluate its utility in a complex medical setting, not to develop maintainable software which could easily be deployed in many hospitals, and easily adapted to changing IT environments and workflows. Which makes sense in a research project (developing maintainable software is not cheap), but does make it difficult to use the research software without major effort (probably a complete rewrite of the code) to enhance maintainability.
Much the same happened with our SumTime weather forecast generator, incidentally. It was used operationally for a period by a company, but stopped being used because of the time and effort required to adapt the software to changing environments and contexts.
Legal and Regulatory Issues
Because BT45 and BT-Nurse impacted clinical care, they cannot be deployed operationally without getting regulatory approval. BT-Nurse was operationally used during its evaluation, but during this period a senior research nurse checked every BT-Nurse text (before it was shown to the duty nurse) to ensure that it would not hurt clinical care. In fact the senior research nurse never rejected a BT-Nurse text on this basis, but regardless we did not have permission to use the system operationally without this oversight, which clearly was not feasible outwith a research evaluation.
The situation was slightly different with BT-Family, since it did not impact clinical care. The biggest problem it faced was that parents wanted to read BT-Family reports at home over the internet, as this would really help parents who could not be at the hospital (a common situation was that one parent stayed at the hospital as much as possible with the baby, but the other parent was at home looking after other children). However, the hospital would not allow this because of its data protection policy (clinical data about patients could not be put on the internet). We tried to argue that the data should be under the control of the parents, who wanted it to be available online, but it was not possible to change the hospital’s policy.
Real-world AI systems need to conform to legal and regulatory rules, as well as corporate policies. And this can be a significant barrier to deployment.
While most staff at the hospital reacted favourably to Babytalk, a few did not. I cannot give details here (this is something I discuss in more detail “off the record” with our MSc students), but anyways the fact that some people had concerns about the system made it more difficult to operationally deploy. We had similar problems (to a lesser degree) with the SumTime weather forecast generator, incidentally.
In general, there are often major “change management” issues in getting existing staff to accept new technology, and I have seen this with Arria projects as well as with university research projects. Medicine can be especially challenging from this perspective, in part because it can be difficult for senior management to order doctors to do things which they do not want to do.
The goals of a research project are to test research hypotheses. Testing these hypotheses may require developing software which is usable in a specific context (like BT-Family and indeed SumTime), so we can measure its impact in real usage. But testing reseach hypotheses generally does not require addressing maintainability, legal and regulatory issues, and change management, at least to the level required of commercial deployable systems.
So we shouldnt expect research projects to produce deployable software. They may produce prototypes which can be converted into operational systems, but this conversion is a major task, which probably will cost significantly more than developing the research system.
One thought on “Why isnt Research Software such as BabyTalk Used?”
I agree on that research projects don’t have to address all aspects of a production project, but I think they should discuss them, at the risk of exploring solutions for nothing real. For example, we should explore how one solution, eg template based, is more maintainable/auditable/etc than others, eg neural net, not only if they get better BLEU, METEOR etc.