In a recent blog, I discussed our findings that a lot of NLP authors refuse to answer questions about their papers, and a lot of NLP papers rely on flawed (poorly executed) experiments. A colleague commented that these types of problems cannot be picked up by reviewers looking at conference and journal submissions. Unfortunately, she is correct. A reviewer cannot assess whether an author is willing to respond to questions and inquiries. A reviewer might be able to comment on experimental *design* flaws, but is not going to be able to comment on experimental *execution* flaws.
For example, we’ve seen errors where bugs in Python code mean that the wrong texts are evaluated, and cases where the numbers reported in the paper do not match the actual experimental results. We cannot realistically ask our reviewers to check Python code for bugs, or to cross-check all of the numbers in a paper against the experimental data files!
So what do we do? We have serious problems which affect the scientific validity and utility of NLP papers, which cannot be detected by our reviewing processes.
The approach that fields such as medicine take is to link discussion forums to published papers, where readers can ask questions and otherwise raise concerns *after* a paper in published. In medicine these forums are monitored by editorial staff, who may withdraw papers if authors fail to respond to questions and/or readers raise serious concerns about validity which authors cannot resolve. Having a paper withdrawn for these reasons can have an impact on an author’s career.
I think there is a useful analogy here with commercial software development. Of course software is tested before it is released, but software houses nonetheless expect bugs to be found *after* a product is released. They have formal processes for this, which allow users to raise bug reports, and patches to be issued which fix these bugs. Good medical journals likewise have a formal process for allowing users to raise concerns and then asking authors to issue corrections (ie, patches) to papers.
In NLP, none of the publication venues I am aware of provide such discussion forums, or in any way support post-publication monitoring of papers. Perhaps this is because discussion forums should be monitored to filter our spam and inappropriate comments as well as to detect cases where a paper may need to be withdrawn. Doing this for a large xACL conference is unthinkable. I guess in theory journals such as TACL and CL could do this, but it would not be easy for them, and I suspect they do not see much demand in the community for this.
Discussions are possible using OpenReview. However, although OpenReview is used by ACL Rolling Review (ARR), I don’t think ARR enables post-publication discussions of papers on OpenReview. Similarly, many NLP papers have associated GitHub sites, which in principle could support discussion by readers raising Github issues about the research paper. However, in practice Github issues are rarely used for general comments about research validity, they mostly focus on software issues.
Improve Status Quo?
I guess the other option is to make the current “informal” system work better. Ie, expect authors to respond to questions and concerns from other researchers, and expect papers to be formally updated and corrected if problems are found (ACL Anthology, TACL, CL, etc, allow papers to be corrected after publication).
The problem is that this isnt working; most authors dont respond to questions, and formal corrections to published papers for research flaws (as opposed to correcting authorship, for example) are rare. Furthermore, I suspect the situation is getting worse; certainly in my personal experience, authors were much more likely to respond to questions in 2010 than in 2023.
I suspect that a lot of research published in “prestige” NLP venues is flawed, and that most authors resist rather than support attempts to check published work for experimental errors. What most disappoints me that there is so little awareness and discussion of this issue in the community. If we value high-quality scientific research, this needs to change!
One thought on “Limits of pre-publication reviewing”