The ACL 2023 conference has a special “reality check” track, which amongst other things asks for papers which look at whether “reported performance improvements on NLP benchmarks are meaningful”. I may submit a focused paper to this track with some colleagues, but I’m also wondering whether to submit a more general opinion piece or position paper (not sure, still thinking about this). Because I think the leaderboard approach to NLP really hurts the field! I’ve seen zillions of papers which present small improvements on leaderboard tasks, and very few of them have been worth reading. Basically leaderboard papers are (A) often scientifically dubious and/or (B) have little relevance to real-world NLP (or science of language) because of inappopriate datasets and/or evaluation. Also (C), leaderboard fixation discourages people from working on very important issues that do not fit into leaderboard contexts.
I should say that by “leaderboard task”, I mean a dataset and evaluation metric where a public record is kept of the systems which have the highest score on the metric for this dataset.
The first point is that if we look at leaderboard papers which are stand-alone, ie not done as part of a shared task, then I suspect a lot of these are scientifically dubious, because of the following factors.
- Poorly done experiments: I’ve been involved in several reproducibility projects and events over the past few years, and one thing I’ve learned is that a lot of experiments are flawed because of sloppy techniques and/or buggy code. It doesnt help that the academic community shows so little interest in quality assurance (blog).
- Knowledge of test data: A lot of authors in this space are very familiar with the test data in the leaderboard data set. Even if they don’t explicitly train on test data, being familiar with the test data may guide them into approaches which are essentially tuned to the test data.
- Multiple hypotheses: Its easy to run hundreds or even thousands of experiments, trying small tweaks or even just a different random seed. The laws of probability tell us that if you run a thousand experiments, you’re probably going to get good results on a few occasions just by being lucky, even if your system is no better than state-of-art (blog).
I should say that the above problems should *not* occur in a well-designed and executed shared task, where participants do not see the test data before submission and the actual evaluation is done by someone else. But many (most?) leaderboard papers are done stand-alone, not as part of a shared task.
If we expect leaderboard results to be meaningful predictors of real-world utility, then the datasets used in leaderboard tasks need to be high-quality and representative of real-world usage. Unfortunately, a lot of leaderboard data sets are nothing like real world usage. For example, the most common datasets of news summarisation, CNN/DailyMail and XSum, do not contain actual news summaries (blog). In general the academic community seems to have little interest in encouraging the usage of high-quality datasets, and indeed sometimes seems to encourage the use of *low-quality* data sets (blog).
A related point is datasets which contain synthetic data. For example, the WeatherGov weather-forecast corpus, which was the subject of many leaderboard-type papers, consisted of texts produced by a rule-based NLG system, not actual weather forecasts written by people. Which meant that the goal of the leaderboard task was to build a neural or ML system which could reproduce a rule-based system. I dont understand why this is interesting or useful (blog). Certainly no one in the real world is going to use a complex neural black-box model which approximates a white-box rule-based model, they’ll just use the original rule-based model.
Lastly, in real-world ML and NLP, data cleaning and preprocessing, and dealing with data quality issues in general, are huge pain points. Its possible to set up leaderboards which include this, but most dont, probably because researchers are more interested in playing with models instead of spending time on data cleaning, despite the fact that data cleaning is often more important than modelling in real applications,
Similarly, if we expect leaderboard results to be meaningful predictors of real-world utility, then the evaluation techniques used in leaderboard tasks need to predict real-world utility. Unfortunately, most leaderboard evaluations use metrics which have little correlation with real-world utility; for example I am amazed at the continuing use of ROUGE despite its well-documented flaws (blog). A related problem is the use of metrics such as BLEU which only have a weak correlation with utility, which means that small differences in metrics scores (which is what leaderboards focus on) may not translate into differences in real-world utility (blog).
In this space, I am especially unhappy that most leaderboards in neural NLG refuse to properly measure accuracy, probably because doing so is a lot of work (blog). Accuracy is of paramount importance in most NLG use cases, so any leaderboard in such cases which ignores accuracy (or measures it using flawed metrics which give inaccurate and incomplete assessments of accuracy) is going to have very little relevance to real-world utility.
Last but not least, almost all leaderboards that I have seen focus on average-case performance, but in many real-world use cases we care about worst-case performance as well. Especially if there are safety issues (blog)! Ie, we often want to guarantee a minimum level of quality, as well as promise a good average level of quality.
Ignoring what really matters
One of my biggest complain about leaderboards is that they ignore some of the most important challenges in NLP, including
- Requirements: What do users actually want NLP systems to do? This is hugely important for real-world success (blog), but mostly ignored by NLP researchers, and completely ignored by leaderboard papers. In *some* cases the people who create leaderboard tasks do try to consider real-world requirements, but the people who build systems for leaderboards ignore requirements. I suspect we would see more papers on requirements, novel use cases, and related issues such as text-vs-graphics if we moved away from leaderboards…
- Maintenance, including domain shift: The real world changes, which means that models and systems which work well in 2022 may not work well in 2025, or even 2023. We know from sofware engineering that most of the lifecycle cost of a successful software produce is maintenance, so I am very disappointed to see so little about maintenance, robustness, domain shift, etc in the academic NLP literature. Again I wonder if this is due to leaderboard fixation, since leaderboard papers dont need to worry about maintenance issues.
- Testing and quality assurance: It is very hard to test NLG systems, especially neural systems which are stochastic in the sense that they produce different outputs on different runs. This is a huge problem in real-world NLG, and I am very disappointed that so few academics are interested in it. Again, I suspect this is due in part to leaderboard fixation, since leaderboards ignore quality assurance issues.
Finally, the goal of research of course is scientific insights about language, communication, etc as well as useful technology. While some leaderboard papers do make claims about this, I think its very hard to make robust scientific claims based on a system doing slightly better that state-of-art in a leaderboard, and I dont think I have ever been convinced by such claims in a leaderboard paper. The leaderboard mentality encourages incremental tweaking and applying model X with tweak Y to problem Z, which is not the way to make fundamental progress
My personal opinion is that the fixation of the academic NLP community on leaderboards leads to a large number of low-quality papers proposing small questionable improvements on leaderboard tasks of little relevance to real-world NLP (or scientific insights about language, etc). This fixation also reduces the amount of research on really important questions that do not fit the leaderboard model, such as what understanding what users want NLP systems to do. In short, leaderboards are hurting the field!
Anyways, I now need to decide whether to expand the above into a proper opinion piece paper, not sure…
5 thoughts on “I dont like leaderboards”
I would love to read that position paper you plan to write. On that matter, do you know if we can search on aclanthology for past position papers?
Hi, I dont think the ACL Anthology explicitly records whether a paper is a position paper in its meta-data. Of course you can always search for papers that have “Position Paper” in their title.
Hello, it is a really interesting blog post. I am a first year PhD student and I have just started working on a NLG task. The first thing I noticed is that it is really hard to evaluate a NLG model. This blog post has given me so many foods for thought and I may need to read it multiple times to carefully colllect my thoughts.
I will ask one question at this moment though. You mentioned that it is really to test NLG models due to their stochastic nature. What is a good way to deal with this? The previous papers on the task on which I am currently working sidesteps this issue by using greedy decoding, which I did not like.
Sorry for the long comment. Thanks a lot for the post.
Hi, by “testing” I meant software testing as done in software engineering. Very tough challenge in commercial NLG, but essential for user acceptance. Current solutions tend to be ad-hoc, we need something better.
Academic “evaluation” is related but different. I cant give specific advice without more information (feel free to email me), but I’ve written a number of blogs about NLG evaluation which might be useful. Most generic is https://ehudreiter.com/2017/01/19/types-of-nlg-evaluation/ (this is pretty old but fundamentals have not changed).