chatGPT in Health: Exciting if we ignore the hype
I think there is a lot of potential in using chatGPT in healthcare, provided that we focus on real use cases instead of trying to debate whether chatGPT is somehow better than a doctor.
I think there is a lot of potential in using chatGPT in healthcare, provided that we focus on real use cases instead of trying to debate whether chatGPT is somehow better than a doctor.
Over the past few weeks, on several occasions I’ve struggled to understand papers because authors made mistakes in references, tables, figures,or formulas. I know that its boring for authors to check such things, but it makes life much easier for your readers!
I would love to be able to define objective criteria for evaluating NLG texts. In principle, I think we can use task-based evaluation to measure utility, and some kind of mistake counting to measure accuracy. However its harder to think of a way to measure fluency without relying on human judgements
Reviewing for big NLP conferences has changed drastically since 1990, when 11 senior researchers reviewed all ACL submissions. Perhaps our expectations about conference papers also need to change, and become more similar to expectations in other scientific fields.
Many people have asked me if OpenAI’s GPT3 will have a big impact on NLG. I suspect its overall impact will be limited (outside of a few niches), but of course time will tell.
I was very impressed by a paper we recently read in our reading group, which showed that small differences in BLEU scores for MT usually dont mean anything. Since lots of academic papers justify a new model on the basis of such small differences, this is a real problem for NLP.
NLG texts need to communicate good content as well as be accurate. Rule-based NLG systems are very good at accuracy, but sometimes struggle to reliably choose appropriate content in a wide variety of circumstances.
Most reviewing is a chore, but reviewing for TACL is fun. I learn things and feel I “add value”, which is much rarer in conference reviewing. Plus I can focus on one paper at a time, since TACL reviewing is spread out across the year.
If an NLG system produces inferior texts once in a while, should we ask a human writer to “post-edit” NLG texts? I review some of the literature and give some advice.
The Tibco Covid dashboard is a nice example of how NLG narratives can “add value” to complex visualisations. Hopefully we’ll see more dashboards like this!