If we want to build better NLG systems, we need to address
- Task: What sorts of texts do users want, in what contexts?
- Data: What data do we use to create models?
- Model/algorithm: How do we actually generate texts?
- Evaluation: How do we determine how good texts are, and what problems they have?
To me, all of the above are essential to progress in NLG! So I find it disappointing that the NLG and NLP communities largely fixate on models and ignore everything else. Models and algorithms are important, but only if they are based on understanding what people want, build on high quality data sets, and are properly evaluated!
Below I say a bit more about these, and give some examples of recent work from my students and collaborators.
The most important question when building a practical NLG system is “What do users want”? In other words, what are the requirements for the NLG system? This includes issues such as
- What insights do users want to see? What mixture of text and graphics should be used to communicate these?
- What language should be used to express the insights?
- How will users interact with the NLG system?
There is very little published on this topic in the NLP research community, which is a shame. I’m happy to say that one of my students, Francesco Moramarco, will have a paper at NAACL about how he and his colleagues worked closely with clinicians to understand what they wanted and needed from a system which summarised doctor-patient consultations. For example, they discovered that clinicians wanted summaries to be produced incrementally in real-time during a consultation, which is not the way summarisation systems usually work.
I really hope to see more such papers published, I think its crazy that the community seems to have so little interest in understanding what people actually want NLG (and NLP) systems to do!
The second question in building an NLG system is data. If we use a machine learning approach, we generally need a parallel data-text corpus, or at least an annotated corpus of human-written text. Producing high-quality corpora is a lot of work, and one of the things that most angers me about the research community is that it seems to push researchers towards using inferior data sets even when better ones exist! This is insane if we care about science.
One positive development over the past few years is that more effort seems to be going into creating high-quality datasets for interesting task, and this work is begin recognised. I was very happy that the paper Inducing Positive Perspectives with Text Reframing, which is largely about an (impressive) dataset, won an outstanding paper award at ACL this year.
Anyways, looking at my students, Francesco and his colleagues have just released a corpus of mock patient-doctor dialogues, and Simone Balloccu (another student) and his colleagues have just released a corpus of (mock) patient-therapist dialogues. Both of these corpora use actors instead of patients, since data privacy issues make it very hard (impossible?) to publically release medical dialogues with real patients, but I they should still be very useful to researchers who want to build NLG systems that interact with patients.
Incidentally, another aspect of “data” is extracting high quality input data from available data sources; eg high-quality sports data to feed into a sports narrative generator. This is a major issue in building real-world NLG systems, but something which the academic community continues to ignore.
There is of course a huge amount of research on models and algorithms for NLG. So I wont say much about this here, but I do want to say that I would love to see more models/algorithms which explicitly address a user need/requirement (as described above), as opposed to models whose main goal is to demonstrate small improvements in dubious leaderboards.
One example of this is a recent INLG paper on explaining decision tree predictions in a way which takes expectations (from background information) into consideration, by Sameen Maruf and colleagues at Monash University, who I am collaborating with. Focus is not on fancy models (no mention of transformers!), but rather on carefully analysing different scenarios, and checking with users whether behaviour is appropriate.
I have written a lot of blogs about evaluation, so I wont go into detail here. But I will say that some of the recent work of my students on evaluation is described in the blogs Why is ROUGE so popular? , Humans make mistakes too , and Pragmatic correctness is a challenge for NLG . Simone Balloccu also has a paper coming out soon on Beyond calories: evaluating how tailored communication reduces emotional load in diet-coaching , which looks at an important but rarely researched type of evaluation, emotional/stress impact.
I must say that one thing I am really happy about is the increasing attention the research community is paying to research on evaluation, this is exciting to see and long overdue! I recommend Gehrmann’s recent survey (Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text) to anyone who wants to learn more about recent research on NLG evaluation.
On other hand, one thing that really annoys me is that evaluation *practice* (ie, evaluation sections in papers which present models) remains dire in many cases. For example the continued reliance on ROUGE for summarisation evaluation is insane if we are serious about doing real science!
If we want to learn how to build useful NLG systems, we need to look at task/requirements, datasets, and evaluation as well as models and algorithms. The fixation of the NLP research community on models is not healthy, especially when models focus on beating state-of-art in artificial leaderboards instead of addressing real user needs.
The increasing interest in evaluation research and novel datasets is encouraging, but this needs to be matched by use of state-of-art evaluation and datasets in research projects. And we need to acknowledge the importance of understanding requirements and user needs!