I was very impressed by a recent paper from a team at Facebook about a production-ready end-to-end neural NLG system. Especially interesting to me was the “engineering” approach to key issues such as accuracy, data collection, and latency.
I was shocked when a PhD student recently told me that he thought he had to focus on end-to-end neural approaches, because this dominates the conferences he wants to publish in. I’m all for research in end-to-end neural, but fixating on this to the exclusion of everything else is a mistake. Especially since end-to-end neural approaches do not currently work very well.
Craig Thomson and I will present a paper at INLG on a methodology for evaluating the accuracy of generated texts, based on asking human annotators to mark up factual errors in a text. This is not cheap, but I think it is the most robust and reliable approach to measuring accuracy.
I would like to see more PhD students and postdocs “getting their hands dirty” by collecting real-world data, working with real-world users and experts, and conducting real-world evaluations with users. Its not easy, but engaging with the real world does help scientific and technological progress.
I recently attended a workshop on Safety for Conversational AI, which discussed how such systems could potentially harm people. Is it possible that NLG systems could harm their users, maybe even contributing to death in the worst case?
This is a personal blog. My son visited home after spending 6 months in a residential special-needs school because of lockdown. It was wonderful to see him again, and I took a few photos of his time at home.
Over the past few weeks, on several occasions I’ve struggled to understand papers because authors made mistakes in references, tables, figures,or formulas. I know that its boring for authors to check such things, but it makes life much easier for your readers!
I would love to be able to define objective criteria for evaluating NLG texts. In principle, I think we can use task-based evaluation to measure utility, and some kind of mistake counting to measure accuracy. However its harder to think of a way to measure fluency without relying on human judgements
Reviewing for big NLP conferences has changed drastically since 1990, when 11 senior researchers reviewed all ACL submissions. Perhaps our expectations about conference papers also need to change, and become more similar to expectations in other scientific fields.
Many people have asked me if OpenAI’s GPT3 will have a big impact on NLG. I suspect its overall impact will be limited (outside of a few niches), but of course time will tell.