Systematic Reviews in NLP
Systematic literature reviews are a powerful and useful methodology for investigating many research questions. I give a high-level overview for NLP researchers who are not familiar with this technique.
Systematic literature reviews are a powerful and useful methodology for investigating many research questions. I give a high-level overview for NLP researchers who are not familiar with this technique.
Our latest paper from the ReproHum project discusses experimental flaws we have encountered while reproducing earlier experiments, including code bugs, UI problems, inappropriate exclusion of data, reporting errors, and ethical lapses. Pretty depressing. These types of errors are not detected by usual NLP reviewing practices, so I suspect they may be pretty common…
One very positive aspect of 2023 for me was that I saw lots of really interesting research papers, much more than in previous years. Perhaps because the emergence of LLMs have encouraged some people to move away from scientifically dubious leaderboard chasing and towards more interesting research on scientific fundamentals? I describe a few of these papers here.
At its best, peer review can significantly improve the quality of papers; its not just an accept/reject gate. I describe a few examples where peer review has led to major improvements in the quality of my papers.
This year I was both a TACL Action Editor and an ACL Senior Area Chair. This experience has reinforced my belief that the journal review process is better!
Many problems in NLP papers can *not* be detected by reviewers who are checking submissions to conferences and journals. In medicine and many other field of science, people can raise concerns about papers *after* they are published, and authors are expected to take this seriously. This is not the practice in NLP, which is a shame.
In our ReproHum project, we have found that many NLP experiments are flawed, and many authors do not respond to requests for more information about their work. This is depressing and hinders scientific progress in NLP.
I dont like leaderboards, which encourage academics to write papers about small improvements on established tasks and datasets. I suspect (and hope) that chatGPT and similar systems will encourage people to move away from leaderboards. If so this would be great!
Is fraud (eg fabricating or falsifying data) a problem in NLP? It certainly is a problem in other scientific areas, and it wouldnt surprise me if it affected NLP as well.
Since commercial researchers dominate the “hot” area of large language models, I’ve seen a number of people ask “what should academic researchers focus on”. There are of course huge numbers of exciting and valuable scientific research questions which are not of much commercial interest, including long-term work which wont pay off commercially for 10+ years, high quality evaluation, socially useful but low-profit applications, and using NLP to research fundamental cognitive science questions.