academics

Make your next experiment a bit better

Over the past 6 months, I’ve given a number of seminars which were partially about problems I see in the rigour and reproducibility of NLP evaluations, and how I’d like this to be improved. I won’t go into detail here since I’ve discussed this topic in previous blogs (Challenges in Evaluating LLMs, Common Flaws in NLP Evaluation Experiments, Unresponsive Authors and Experimental Flaws, A bad way to measure hallucination, etc); you can look at the PDF of my most recent seminar (at Peking Uni) if interested.

After most of these talks, attendees have told me said that they appreciated and agreed with what I was saying in principle, however it simply was not realistic to address all of the issues I raised. Ie, doing very rigorous and replicable experiments is a lot of work, which brings limited benefits (most reviewers don’t care about these issues).

Speaking of reviewing, on a number of occasions recently I have as a reviewer pointed out serious concerns about experiments, including data contamination and useless metrics. Most of the time, authors basically respond that there are plenty of papers at “high-prestige” venues such as ACL which have similar problems, so why am I picking on them? Ie, they see that lots of xACL papers have data contamination or other serious flaws in evaluations. So they assume that its fine to have garbage evaluations in xACL papers, and think its unfair when a reviewer (like me) complains about their evaluation.

I find this very depressing. I work a lot with medical researchers, and in medicine I do see a commitment to high quality experiments. Of course some medical experiments are flawed, but the researchers (or at least the ones I talk to) want to do high-quality replicable experiments, they see this as essential to scientific progress. So its depressing to compare this perspective to the one I see in NLP…

Warning: Rant

I feel strongly about this, so I am going to rant – skip to the next section if not interested!

Reliable repeatable experiments are the core of science and indeed the scientific method. The core of science is careful hypothesis-testing, and people who are not interested in doing this are not scientists. Science also is an incremental activity, where we build on the work of previous scientists; but this only works if the work of previous scientists can be trusted! From this perspective it is a shame that NLP researchers are reluctant to correct (or retract) flawed papers.

It sometimes feels like NLP researchers mostly want to get top scores on leaderboards, with little interest in whether this means anything scientifically. Some leaderboards are based on good experiments, some are based on rubbish experiments, and some are based on experiments which initially were good but now are rubbish (eg, because of data contamination). But many researchers don’t seem to care about whether the experiments mean anything, they just want to “win” the leaderboard. Which may make sense when playing a computer game, but not if the goal is scientific progress.

End rant

Can we encourage more rigour?

At the end of the discussion after my latest seminar (Peking Uni), people felt that while doing everything I recommended might not be realistic, it should be possible to do some of what I said. Ie, to incrementally increase the rigour and replicability of the experiments they did. I think this makes sense, incremental change is more realistic than radical change.

So I’d like to conclude by asking my readers to try to do a bit better with regard to experimental rigour and reliability. For example, largely following my Ten tips on doing a good evaluation, consider doing a few of the below (you don’t need to do everything!!)

  • If you’re working on an applied project, ask users what they care about, and try to nudge your evaluation in this direction.
  • Don’t use BLEU or ROUGE, there are plenty of better alternatives. If doing a human evaluation, consider annotation instead of ratings or rankings.
  • Take a look at your test data (dont just treat it as a black box), check if it is sensible and appropriate.
  • Check if data contamination could be a problem.
  • Use open-source models (much easier to replicate).
  • Carefully execute and report experiments; check for bugs.
  • Just run an experiment once, dont run it many times and just report the best result.
  • If other researchers have questions about published papers, respond to them, dont ignore. If you discover that a published paper has mistakes, correct or retract it.

The above steps are not rocket science, and should be doable by most NLP researchers. Again, don’t feel obliged to do all of the above! Just pick a few (even just one) which make sense in the context of your work.

If everyone does this, then evaluation quality in NLP will start to improve. Problem wont be solved overnight, but at least things will be moving in the right direction.

Final comment

I feel strongly about this topic, feel free to contact me if you have questions or comments! Im also happy to continue to give seminars about this topic.

4 thoughts on “Make your next experiment a bit better

  1. I totally agree with your rant. I know scientists who would totally agree with you, yet, when they sit on a recruitment panel, they don’t realise that their behaviour is part of the problem. When applicants are shortlisted for interview, the numbers are looked at – How many publications at top venues. The PhD student who does 9 quick and dirty papers will be selected over the one who did two very thorough pieces of work. Members of the panel do not thoroughly review a paper (as far as I have seen), and usually they are not in the same specialism of research. It becomes a numbers game. Papers accepted at top conferences “look” good. I have seen senior professors at recruitment panels being very impressed with young applicants with an exponentially rising trajectory on google scholar.

    Universities would be reluctant to employ someone who is not strong at playing the numbers game, because the researcher who is “careful and thorough” will likely only obtain a small number of small grants in a career. Those researchers who start out “careful and thorough” are likely to be forced to play the numbers game, at least a little.

    The same type of selection (based on numbers in track record) is at play for funding applications, promotion applications, admission of PhD students…

    When the system selects for researchers who do “quick and dirty” work, can we be surprised if we see a proliferation of such work?

    The “careful and thorough” people are being weeded out, or forced to adapt.

    Perhaps the whole sector is suffering from a “scaling problem”. As the population of researchers increases this is then likely to get worse. The selective pressure increases. Recruitment panels have more applicants to sort through, conferences and funding bodies drown under thousands of submissions and have to increase the load on reviewers, who have to make decisions faster, and increase the population of reviewers, who might be inexperienced. Selective pressure leads to increased adaptation and reduced diversity. Everybody has to play the same game.

    Clearly good research is still getting done, because we see a lot of undeniable evidence of advancement in our fields. I feel that the “numbers game pressure” is just a drag on everything. Everybody is needlessly losing some of their productive time because of it. That comes through more reviewing, writing more grant applications (to be competitive) or being pressurised to do more quick and dirty work (to be competitive), in addition to the real work.

    AI is likely to impact both sides of this. It will help by automatically reviewing papers with criteria as strict as you like. It will harm by automatically generating papers that play the metrics better than any human can.

    Like

    1. I understand your frustration! Unfortunately lack of interest in rigorous experimentation, and stress on quantity over quality, is part of the research culture of AI, and research culture is hard to change. This is why I focused on incremental rather than radical improvements in this blog.

      I one thing that would help is having truly selective venues that took experimental rigour seriously. In NLP, the “prestige” venue is xACL conferences, which published over 3K papers annually. Not very selective, and based on a reviewing process which is not very rigorous (probably inevitable given the numbers). But suppose we had a venue which published only 100 superb papers annually after a very rigorous reviewing process, with post-publication monitoring (ie if concerns were raised about a paper which the authors did not address, the paper was retracted). If a small number of papers published here were considered as better than a larger amount of xACL papers (by hiring boards, etc), this would reward quality over quantity. This is basically what happens in most fields of science, including medicine, shame it doesnt happen in AI.

      Like

Leave a comment