Challenging NLG datasets and tasks
I would like neural NLG researchers to focus on more challenging datasets, and make some suggestions.
I would like neural NLG researchers to focus on more challenging datasets, and make some suggestions.
Users want to be able to modify and customise NLG systems on their own, without needing to ask developers to make changes. Academic researchers mostly ignore this, which is a shame, since there are a lot of interesting and important challenges.
This is a personal blog, about how Covid lockdown has affected me. In practical terms I’m much better off than many people I know, but I still find that lockdown life has lost a lot of its “fizz” and become “flat”.
A few observations (not recommendations!) about what it is like to work as a researcher in university and corporate contexts.
I was impressed by a recent paper by Läubli et al which experimentally compared the results of different human evaluations in MT (eg, how do results differ between expert and non-expert human raters), in the context of understanding when MT systems are “better” than human translators. Would be great to see more experimental comparisons of different human evaluations in NLG!
Seven papers which I blogged or tweeted about in 2020, covering evaluation, safety, engineering and system building, and long-term perspective on NLP. I recommend these to all; they made an impact on me, perhaps they will make an impact on you as well!
I was very impressed by a recent paper from a team at Facebook about a production-ready end-to-end neural NLG system. Especially interesting to me was the “engineering” approach to key issues such as accuracy, data collection, and latency.
I was shocked when a PhD student recently told me that he thought he had to focus on end-to-end neural approaches, because this dominates the conferences he wants to publish in. I’m all for research in end-to-end neural, but fixating on this to the exclusion of everything else is a mistake. Especially since end-to-end neural approaches do not currently work very well.
Craig Thomson and I will present a paper at INLG on a methodology for evaluating the accuracy of generated texts, based on asking human annotators to mark up factual errors in a text. This is not cheap, but I think it is the most robust and reliable approach to measuring accuracy.
I would like to see more PhD students and postdocs “getting their hands dirty” by collecting real-world data, working with real-world users and experts, and conducting real-world evaluations with users. Its not easy, but engaging with the real world does help scientific and technological progress.