Along with Craig Thomson (research fellow), Anya Belz (PI) and many collaborators and partners, I am working on a project, ReproHum, which is looking at the replicability of human evaluations in NLP. Two depressing findings of this project are that (A) most authors are unable or unwilling to provide detailed information about their experiments and (B) all of the experiments we looked at closely had execution flaws. This is described in our paper “Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP“, which will be presented at the EACL Negative Results workshop on Friday 5 May.
As explained in the paper, we identified 116 ACL and TACL papers which we could potentially try to reproduce. In all cases there was insufficient information in the paper to allow other researchers to recreate this experiment under similar conditions, so we contacted the corresponding authors and requested the additional information we needed. Authors only responded in any fashion for 45 (39%) of these papers, and we were able to get the information we needed for only 15 (13%) of the papers. For 71 (61%) of the papers, we got no response whatsoever to our request.
I find this to be shocking. A fundamental principle of science is that scientists make progress by “standing on the shoulders of giants” (in Newton’s words), ie by building on work done by previous scientists. A successful scientific contribution is not a stand-alone finding, but a building block which other scientists can build on. But it’s much easier for scientist A to build on scientist B’s work if scientist B is willing to answer questions about their work! If scientist B ignores such requests, then their work is much less useful in the collective scientific endeavour. High-quality medical journals, for example, provide discussion forums for papers they publish, and expect authors to respond in a timely fashion to questions and comments.
In short, if we are serious about doing science, then we need to answer questions about papers even after they are published. Unfortunately, most NLP authors refuse to do this; I’ve seen this in other contexts as well as in RepoHum, and other researchers have told me that they have had similar experiences (eg, see Section 5.2 of Arvan et al 2022). I suspect many NLP authors primarily think of papers as CV enhancers instead of scientific contributions, which says something pretty depressing about our scientific culture.
Anyways, going back to our paper, we eventually chose 6 experiments to replicate. Unfortunately, *every one* of these experiments turned out to have flaws (some of them serious) in experimental execution, analysis, or reporting (the paper says 1 of the 6 experiments was OK, but one of our partners recently discovered that it is flawed as well). The flaws are described in Appendix A.6 of the paper; they include code bugs, results in papers disagreeing with experimental data, and randomisation errors.
This again is depressing. Remember that these 6 experiments came from the small number of authors who were willing to discuss their work and give us detailed information; it seems plausible that such authors care more about their research than the authors who did not give us information about their experiments. These papers also were all published in ACL or TACL, which are among the best NLP venues. But still, *all* of these experiments had flaws. From the “building block” perspective mentioned above, this means that a lot of NLP research leads to flawed building blocks.
A related point is that it is very rare for NLP authors to correct published papers if flaws are found in them. I asked both TACL and ACL Anthology for data on corrected papers (errata). I have not yet gotten a response from ACL Anthology, but TACL told me that in 2013-2022, they had *one* correction (erratum) about experimental results in a published paper (Warstadt et al 2020). If our experiences are at all representative, there are a lot more published TACL papers that should be corrected! I am grateful to Warstadt et al for correcting their paper, and disappointed that no other TACL authors have done so.
Again this relates to science as an accumulation of building blocks; if we discover a mistake in our work, it is essential to acknowledge and fix the error if we expect other scientists to build on our results. In medicine this is taken very seriously, but apparently not in NLP.
We need to do better science!
A fundamental principle of experimental science is that experiments must be done carefully: well designed, meticulously executed, and rigorously reported. Experimenters also need to answer questions and issue corrections after a paper is published. This is true of medicine, biology, and physics, and it is also true of NLP. Unfortunately, many NLP researchers do not do careful experiments (and ignore their papers after publication), and to me this suggests that there is a problem with our scientific culture.
I expect that “engineering-oriented” NLP is going to mostly be done in companies in the future, which means that academic researchers will “add value” by doing scientific research on NLP. But we aren’t going to add much value if we keep on doing sloppy experiments!
In short, if we claim to be scientists, then we need to care about doing rigorous science! Rigorous experiments take longer to do, and acknowledging and correcting errors in published papers is not pleasant, but these are essential to being good scientists.
2 thoughts on “Unresponsive Authors and Experimental Flaws”