Our 2022 Publications: NLG Evaluation, Requirements, Resources

I thought I would end 2022 by giving a summary of the papers published by my students and I in 2022. I’ve characterised each paper as being about requirements for NLG, evaluation of NLG, or resources for NLG; apologies because of course some papers are on the borderline between these areas!

No one in my group published a “leaderboard” paper in 2022 claiming to do 1% better on some existing dataset. Its not part of our “research culture”.

NLG Requirements

One of the most important challenges in NLG is understanding what users want NLG systems to do. Ie, where does NLG “add value”, what do users want and expect, and how should NLG developers works with users to understand thir needs. This topic is incredibly important, and I’m happy that my students are shedding light on it!

Requirements gathering methodology for NLG: Francesco Moramarco and his collaborators published a great paper at NAACL (User-Driven Research of Medical Note Generation Software) on how they worked with users to understand their needs and constraints, in the context of building an NLG system which summarised doctor-patient consultations. They used a wide range of techniques including personas, mockups, wizard of oz studies, and a live beta test. Amongst other findings, they discovered that doctors wanted summaries to be generated incrementally as the consultation progressed, which is not something that current systems do. This paper was given a prize for Best Paper on Human-Centered NLP special theme at NAACL.

Text and Graphics: Simone Balloccu published a paper at INLG on Comparing informativeness of an NLG chatbot vs graphical app in diet-information domain which tested the “value” to users of adding NLG explanations to visual presentations of data in a health chatbot. Overall Simone found that adding NLG explanations increased user’s understanding of the data; see the paper for more details.

Useful content: Jaawad Baig published a paper at the NLG for Health workshop on DrivingBeacon: Driving Behaviour Change Support System Considering Mobile Use and Geo-information. Jawwad is working on providing feedback to drivers to help them drive more safely. After talking to users, he assessed the impact of adding information on mobile phone usage and also on unsafe driving near sensitive areas such as schools. Results were not conclusive, futher work is needed.

NLG Resources

Of course resources such as datasets and software are essential for NLG research, and we are trying to contribute here as well.

Corpora: Simone and his collaborators released a dataset of annotated therapy dialogues (Anno-MI: A Dataset of Expert-Annotated Counselling Dialogues) (Github). They created this to support their own research on the emotional and engagement aspects of dialogues with health chatbots, and also released it as a resources to the community (there are very few publically available therapy corpora).

Large projects: We also contibuted to large group projects to create resources. In particular, Craig Thomson contributed to the GEM benchmarking tool, and I made a small contribution to BLOOM.

NLG Evaluation

Most of our papers were about NLG evaluation, especially human evaluations. Note that some of these papers are not yet available in the ACL Anthology.

Human-metric correlation: Francesco published a paper at ACL (Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation) where he presented a careful human evaluation of summaries of doctor-patient consultations, and checked whether this agreed with metric evaluations. He found that metrics overall were not great predictors of human evaluation, and also the metric which overall had the highest correlation with human evaluation was character edit distance (Levenshtein distance).

Better human evaluations: Francesco also had a paper at EMNLP (Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation) which showed that using structured “checklists” in a human evaluation increased inter-annotator agreement.

Evaluating emotional impact: As mentioned above, Simone is interested in the emotional impact of health chatbots. He presented a paper at the ACL Human Evaluation workshop (Beyond calories: evaluating how tailored communication reduces emotional load in diet-coaching) on how emotional impact could be evaluated. He also had a paper at the EMNLP GEM workshop (Towards In-Context Non-Expert Evaluation of Reflection Generation for Counselling Conversations) which discussed some of the difficulties and challenges in evaluating whether responses in a therapy dialogue were appropriate.

Annotating errors: We continued our work on error annotations for texts produced by neural NLG systems (we should have a journal paper about this in 2023). Criag Thomson had a paper at INLG Generation Challenges (The Accuracy Evaluation Shared Task as a Retrospective Reproduction Study) which examined whether annotation exercises were replicable, ie whether similar results were found when an exercises was repeated. Barkavi Sundararajan presented a paper at the EMNLP GEM workshop (Error Analysis of ToTTo Table-to-Text Neural NLG Models) where she analysed errors in texts generated by ToTTo models, using a modified version of the protocol that Craig developed in his PhD thesis.

Comings and Goings

I was very happy that two of my students obtained their PhDs. Stephanie Inglis was awarded a PhD on Summarising Unreliable Data, and Craig Thomson was awarded a PhD on Complex Data-to-Text. Both Steph and Craig stayed in Aberdeen; Steph took a job at Arria, and Craig is now working as a postdoc on the ReproHum project.

I was also pleased to welcome three new PhD students (who have not yet published any papers about their PhD work).

  • Iniakpokeikiye Thompson started a PhD on driving feedback apps in Nigeria (unsafe driving is a huge problem in Nigeria). I’ve been astonished to discover how little data, corpora, and resources are available in Nigeria (or indeed elsewhere in Sub-Saharan Africa); I suspect much of the PhD will be devoted to resource creation.
  • Adarsa Sivaprasad has just started a PhD (within NL4XAI) which will probably look at generating NL explanations for non-neural ML models, such as decision trees or Bayesian networks. I suspect Adarsa will also work on evaluation of explanations.
  • Mengxuan Sun has just started a PhD which will look at using NLG and health chatbots to support patients who are living at home and managing cancer

2022 in Retrospect

I think NLG requirements, resources, and evaluation are extremely important topics, even if they are not very trendy, and I was happy to see my students progress our knowledge about these topics, and indeed that one paper won an award at NAACL (I was also very pleased to get an INLG Test of Time award for my 2007 paper on data-to-text). I think research on these topics is more important (and will have more long term impact) than papers showing how tweaking a model produces a small gain in a leaderboard task. And hopefully we will have many more papers on these topics in 2023!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s