academics

Does chatGPT make leaderboards less meaningful?

One of the things I most dislike about academic NLP research is the focus on leaderboards. A leaderboard is essentially a scorecard of how well different systems do on an NLP task, ie how well they perform on a fixed dataset as measured by specified evaluation metrics. Allen Institute of AI maintains a public set of leaderboards for many NLP tasks; the XSum leaderboard, for example, tells me that the BabelTar system (at the time I am writing this blog) has the best evaluation scores of the systems entered into the leaderboard.

I explained my dislike of leaderboards ia a previous blog. Among other things,

  • Leaderboards lead people to focus on small improvements and tweaks on existing datasets. I would prefer to see work on truly novel ideas and applications, new datasets, thorough evaluations, and deep analyses.
  • Leaderboards lead to over-optimisation on the target datasets and metrics. If a system does better on the XSum leaderboard, is it because its better summarisation technology or because its better optimised to the idiosyncracies of the XSum task?
  • Leaderboards entrench old datasets. For example, summarisation research has been dominated for years by CNN/Daily mail and XSum, both of which are seriously flawed, because these are the most prominent leaderboards.
  • Leaderboards also entrench old evaluation techniques. Eg, I suspect that one of the main reasons BLEU and ROUGE are still heavily used, despite their faults and existence of better alternatives, is that they are embedded in many leaderboards.

Please note that I very much *support* shared tasks, such as WMT, where participants are invited to create solutions for a novel dataset, which are submitted to a contest and evaluated by the organisers. None of the above critiques apply to well-run shared tasks!

Anyways, I suspect (and hope…) that leaderboards are going to be harder to justify in an era where the best NLP systems are usually based (at least in part) on large language models (LLMs), such as chatGPT or GPT4, which are trained on substantial chunks of the Internet. 

Problem: Leaderboard datasets are on the Internet

The first problem is that leaderboard datasets are usually published on the Internet, which means that they are probably part of the training data for GPT-like LLMs. Since a fundamental principle of machine learning is that ML systems cannot be evaluated on their training data, this means that systems based on chatGPT (etc) cannot be entered into leaderboards whose data is publically available on the Internet.

I realise that efforts are sometimes made to exclude leaderboard datasets from LLMs, for example by requiring accounts and logins to access the dataset. But its very hard to stop datasets from “leaking” (eg, a researcher that uses a dataset might include it in a public GitHub repo). The leaderboard datasets which I am most familiar with are all publicly available on Huggingface or GitHub.

A related issue is that even if its somehow possible to exclude a leaderboard dataset from LLM training data, a lot of leaderboard datasets are derived from public Internet resources. For example, the XSum task is to generate the first sentence in a BBC news article from the remainder of the article (ie, if there are 20 sentences in an article, the XSum task is to generate sentence 1 from sentences 2-20). Since the BBC articles are on the web, they will be part of LLM training data and hence inappropriate for leaderboards even if the explicit XSum dataset could somehow be hidden.

Problem: LLMs are constantly evolving

One of the central assumptions of leaderboards is that system scores are static. Ie, once Allen AI has calculated a leaderboard score for BabelTar on XSum, this score will continue to be valid indefinitely (at least for this version of BabelTar). This means that other participants in the leaderboard know the score they need to beat if they want to do better than BabelTar.

However, recent commercial LLMs such as chatGPT and GPT4 are not static, they are constantly being upgraded. They are commercial products, and as such it makes sense for them to be constantly updated and improved. But this means that an NLP system which is based (even partially) on one of these systems is also going to evolve. Hence, it it is entered into a leaderboard, its score will change frequently, it wont be static.

From a scientific perspective, if we see leaderboards as simply presenting current scores for different systems, we could have leaderboards with dynamically changing scores. But leaderboard culture revolves around beating the top system and claiming the top slot for your own system, and its not clear what this means in contexts where system scores are constantly changing.

A related point is that a lot of leaderboard papers are based on hyper-tweaking models to the specifics of the leaderboard task. But this hypertweaking may not work well for systems based on evolving LLMs, since todays tweak may not work tomorrow. Eg, if I create a super-clever and hyper-tweaked prompt for chatGPT which leads to great performance on a task today, I may discover that this super-clever prompt doesnt work nearly as well next week, because of changes in chatGPT.

Problem: LLM improvements more significant than leaderboard differences

Finally, one thing we see with LLMs is that they are contantly getting better as they get bigger. Which makes small improvements (and usually the difference between top-scoring leaderboard systems is small) of even more dubious value. In other words if we see steady significant improvements in performance over time due to LLMs getting bigger, why should we care about the much smaller improvements which dominated leaderboards, especially given the above concerns?

Final Thoughts

In summary, LLMs make an enterprise which is already scientifically questionable (leaderboards) even more scientifically dubious. Of course, this doesnt mean that academics will stop writing leaderboard papers right away, since its become embedded in the research culture of many organisations. But hopefully usage/prominence of leaderboards will decrease over time, as will the “culture” of focusing research on leaderboard performance.

I want to emphasise that I am not against comparing systems! But this should be done either via a shared task, or by measuring real-world utility.

LLMs are changing academic NLP research in many ways, some of which are depressing; for example replicability is much harder for systems based on constantly-updated LLMs. But LLMs have positive as well as negative impacts, and I for one will applaud chatGPT if it encourages researchers to move away from leaderboards!

One thought on “Does chatGPT make leaderboards less meaningful?

Leave a comment