Five years ago I heard my first paper on neural NLG, at NAACL 2016. I went into the presentation interested and curious; I came out of it shocked by the terrible scientific quality of the paper; inappropriate dataset, meaningless evaluation, no knowledge of previous work, no understanding of the domain and what users actually wanted. This was 5 years ago, have things gotten better?
On the positive side
- Real progress on evaluation; we know how to do this better
- Better datasets, and increasing use of datasheets
On the negative side
- Lots of papers continue to use poor data sets and evaluation, in part because of leaderboard mentality
- Too much focus on simple NLG tasks, less attention on challenging NLG problems and issues
So definitely progress, but not as much as I had hoped for.
5 years ago, evaluation “research” seemed to mostly consist of proposals for new automatic metrics, many of which were poorly validated and few of which were actually used. But over the past few years, we’ve seen a real blossoming of research on evaluation, with excellent papers on human evaluation (blog and blog) and on exploring fundamental issues with metrics (blog). I’m also very happy to see people starting to seriously think about safety issues and worst-case performance (blog). As a science, evaluation of NLP has really advanced since 2016, which is great.
Unfortunately, while we have made good progress in our scientific understanding of how to evaluate NLG systems, evaluation practice (ie, evaluations conducted by people who are not evaluation researchers) remains of mixed quality. Some people do excellent evaluations, but I continue see many papers with poor evaluations.
What is worse is that some people who probably know better continue to use poor evaluations because of the leaderboard mentality. In other words, they want to show that their system improves state-of-art on an established “leaderboard” which specifies task, dataset, and evaluation. So the use the evaluation specified in the leaderboard, even if it is deeply flawed (which is often the case, in part because many leaderboards are several years old).
5 years ago, little attention was paid to datasets. It seemed like a lot of researchers just grabbed data off the internet, without really understanding what was in the data they were downloading. Which led to big problems with data appropriateness and quality, as well as ethics. Despite the importance of data, creating high-quality datasets was regarded as a low-prestige activity. The most frustrating thing was that few people seemed to care, indeed it sometimes seemed that researchers were encouraged to use bad data sets, in part because of above-mentioned leaderboard issues
In 2021, I think things have gotten a bit better. We’re seeing more datasets accompanied by datasheets, which definitely helps; in NLG I think GEM in particular has done a great job from this perspective. Its also encouraging to see more papers about data sets, although these are still often regarded as less prestigious. However, many researchers continue to use poor datasets, again perhaps because of leaderboard issues.
Craig Thomson (one of my PhD students) was working with the “Rotowire” dataset last year and discovered that it had some problems, so he created an improved dataset in this domain which fixed many of these problems. It will be interesting to see whether other researchers in this space start using Craig’s dataset (or better yet, improve it), or whether they continue to use the original dataset with all of its quality issues.
In my experience, some of the biggest challenges in building useful NLG systems include choosing content and insights, handling edge cases robustly, and integrating generated texts into a multimodal information presentation system. Requirements analysis, software testing, and maintainability are also major challenges when building real-world NLG systems.
From this perspective, its a real shame that in the past there was little work on these problems in the neural NLG community. However I think this is starting to change, especially as neural NLG systems enter real-world use. I would also like to see more work on using neural NLG in complex NLG tasks; there isnt a lot of value in developing complex neural models for tasks that can easily be done using rules and templates.
Note that the above challenges are not pure NLP ones. Content determination is probably more of an AI reasoning problem than a pure NLP task; effective multimodal presentation is as much about HCI as NLP; and handling edge cases (plus requirements and testing) is classic software engineering. Unfortunately, the NLP community in (in 2021 as well as 2016) seems somewhat insular, with little interest and interaction with other fields of CS (except for ML).
So overall there is definitely progress, but there are still lots of bad papers published. What is really frustrating is that the “culture” of the NLP community often seems to discourage good science. As mentioned above, this is largely due to a fixation in showing improved performance on existing “leaderboard” tasks. Which is fine for leaderboards based on solid evaluation, dataset, and task; but many are not, and such leaderboards encourage and incentivise researchers to do bad science.
I also suspect that the leaderboard culture discourages people from working on important problems that do not easily fit into a leaderboard, such as multimodality and requirements analysis.
Of course lots of other people have complained about leaderboards not being the way to do good science! Hopefully the culture of the NLP community will change, but this will take time.
There are lots of excellent scientists working on neural NLG (I refer to many of them in the above-cited blogs), and it is thanks to their hard work that scientific quality has improved. I really appreciate their efforts! However, our scientific culture also needs to promote and encourage good science; it is discouraging that this is not always the case.