For the end of 2020, I thought I’d list the papers that made such an impact on me that I decided to write a blog on them (or which was heavily influenced by them). Plus a few papers I tweeted about, and the paper of my own that I blogged most about.
Of course “best paper” lists are very subjective, everyone has their own list! Below is mine, and I can wholeheartedly recommend reading all of these papers!
Papers I wrote blogs about
R Grishman (2019). Twenty-five years of information extraction. Natural Language Engineering, 25(6), 677-692. (my blog)
Grishman summarises progress in information extraction over the past 25 years. When I read a paper, I often add notes about important points. This paper was so full of my notes that I could barely read it. A must read for anyone interested in a long-term perspective on NLP! A few of the points which I really liked
- Progress in information extraction has been significant but not amazing, not nearly as impressive as machine translation (for example). A great reminder that NLP (and indeed AI) is very diverse, and techniques which have a great impact on one part of NLP/AI may have less impact on other areas.
- Researchers dislike performing complex, time-consuming, and expensive evaluations, even if they are more meaningful than simple, quick, and cheap evaluations. Something I’ve often complained about! This unfortunately seems part of the ACL/NLP “culture”, so is very hard to change (but see Final Thoughts below).
- Building systems using ML means less time/effort writing code and rules, but more time/effort in creating high-quality corpora to train models. ML is not a “free lunch”
N Mathur, T Baldwin, T Cohn (2020). Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of ACL-2020, pages 4984–4997. (my blog)
Mathur et al made a hugely important point about BLEU and indeed other evaluation metrics, which is that we dont just want to know how well an evaluation metric correlates with human evaluations overall, we also want to understand the distribution of human evaluation scores for different metric values. In particular, the authors point out that if MT system A is a lot better than MT system B, BLEU will almost certainly pick this up. However if MT system A is only a little bit better than MT system B, then BLEU may not detect this.
In other words, the Pearson correlation which metric validation studies usually report is mostly driven by success at differentiating between very different systems. So a high Pearson correlation does not mean that a metric is good at determining whether a proposed system is slightly better than state-of-art, which is how the academic community mostly used metrics.
One key lesson is that people who propose new evaluation metrics should report the distribution of human evaluations for each metric value, not just an overall correlation. I look forward to this becoming “standard practice” for people who are proposing new evaluation metrics.
T Bickmore, H Trinh, S Olafsson, T O’Leary, R Asadi, N Rickles, R Cruz (2018). Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant. J Med Internet Res 2018;20(9):e11510 (my blog)
Bickmore et al showed that medical advice given by popular conversational assistants could injure or even kill people! The scenarios were made up (and designed to be difficult), so no one was killed in real life, but the fact that this is possible really floored me.
Its great to see that the NLP community is starting to take safety and “worst case” issues seriously, indeed I discovered this paper because of a presentation by Bickmore at a workshop on Safety for Conversational AI. But I think we need to go a lot further; its not acceptable to release NLP systems that potentially can kill people! Safety and “worst-case” needs to become part of our culture; a system with great average-case performance but atrocious worst-case behaviour is not acceptable to society. Especially as NLP technology expands into more safety/life critical contexts, such as medicine.
Incidentally, I see that when Bickmore himself builds health chatbots, he usually asks users to select from a menu rather than type in a free text question/response, because of safety issues. There is an important lesson here for voice/chat technologies.
A Arun, S Batra, V Bhardwaj, A Challa, P Donmez, P Heidari, H Inan, S Jain, A Kumar, S Mei, K Mohan M White (2020). Best Practices for Data-Efficient Modeling in NLG: How to Train Production-Ready Neural Models with Less Data. Proceedings of COLING-2020 Industry Track, pages 64–77. (my blog)
Arun et al present an engineering perspective on building a real-world production NLG system. After reading seemingly endless numbers of papers on end-to-end neural NLG which were scientifically bogus, tackled trivial NLG problems, and/or generated garbage texts, it was incredibly refreshing to read about an end-to-end neural NLG system which actually worked and was useful! Also really interesting to read about the problems and challenges faced in building such a system, including focusing on a targeted domain and use case, taking accuracy seriously but also accepting that sometimes inaccurate texts would still be produced (and choosing use cases where this was acceptable), creating high-quality training corpora, and seriously addressing pragmatic engineering issues such as model size and latency. This paper shows “what it takes” to get end-to-end neural NLG technology to work in the real-world.
And one thing which really struck me is that from an engineering perspective, “real-world” neural NLG in some ways is not that different from rules/knowledge-based NLG. In particular, I’ve seen a lot of papers on neural NLG which criticise rule-based NLG because of the need to build application-specific rule sets; but while Arun et al did not need to build application-specific rule sets, they did need to put a lot of effort into creating application-specific corpora and data sets. Which goes back to the comment made in the Grishman paper (first one in this blog), which I’ve also heard from people who work in commercial AI; ML approaches reduce the amount of time spent writing code but increase the amount of time spent on creating datasets. Whether this is a sensible tradeoff depends on the context and circumstances.
Other papers I mentioned in tweets or blogs
M Ribeiro, T Wu, C Guestrin, S Singh (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. Proceedings of ACL-2020, pages 4902–4912.
I tweeted about Ribeiro et al’s paper, which uses a “software testing” approach to understand the behaviour of NLP models. Great idea!
C van der Lee, A Gatt, E van Miltenburg, E Krahmer (2021). Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech & Language, vol 67.
I also tweeted about this excellent paper by van der Lee et al, which gives advice on how to do human evaluations in NLG. Everyone doing a human evaluation should read it!
C Thomas and E Reiter (2020). A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems. Proceedings of INLG 2020, pages 158–168. (my blog)
With my regard to my own work, I wrote a number of blogs throughout the year about the work that Craig Thomson and I are doing on evaluating the accuracy of generated text. This is unlikely to have as much impact as the other papers mentioned here, but I think a better understand of accuracy and how to evaluate it is essential to progress in NLG! And I encourage people to join our shared task on evaluating accuracy!
Final thoughts: Are we doing better science?
One of the things which most frustrates me about the ACL/NLP community is its acceptance of weak evaluations and dubious data sets, and its reluctance to explore non-neural approaches. This is not the way to do good science! But anyways, I think I am seeing some movement with regard to evaluation (less so with the other issues). There is more awareness that we need better evaluations, and that techniques which are quick, cheap, and easy should nonetheless not be used if they are meaningless. As always when changing scientific culture in a community, its 2 steps forward and 1.5 steps back, but I think there is progress.
In 2016 I gave an invited talk at NAACL about NLG evaluation where I criticised the use of BLEU, and many people, including senior members of the community, responded that they saw nothing wrong with using BLEU (in fairness many other people told me they agreed with me). I dont think I would get this reaction if I gave such a talk in 2021, and this is progress.
Best wishes to my readers for 2021!