Last week I gave a lecture on quality assurance (QA) to my MSc class, mostly focusing on QA in commercial contexts. I explained that if a company spends 6 person-months developing a new AI product, model, or feature, it may well spend an additional 3 person-months on testing and quality assurance before releasing the product (etc). I also emphasised that QA/test is a well structured and planned process in the commercial world.
I also made a few comments on quality problems in academic research, eg inappropriate data sets. So one of the students asked me what the quality assurance process was for academic papers/research, and I said it was primarily based on peer review from academic reviewers. Which of course is a much lighter-weight process than commercial QA! If I spend 6 months developing a commercial AI product, someone else will spend 3 months testing it in a very structured way. Whereas if I spend 6 months doing academic research and writing a paper, peer reviewers will spend perhaps 3 *hours* (in total) checking my work, in a somewhat ad-hoc manner.
This bothers me. Of course real-world products must be held to higher quality standards than academic papers, but still… three *hours* (for checking academic work) vs 3 *months* (for checking commercial work)?? Are there other quality assurance techniques we can use to check academic work, beyond peer review by unpaid volunteers?
A shocking example of a research paper with serious quality defects having real-world impact was described in a recent Guardian article. The academic paper was economics (not AI) and claimed that countries which had a had a debt-GDP ration above 90% suffered from very poor growth. This influenced and encouraged austerity policies in the UK and elsewhere, which reduced public services in order to reduce debt. However, the paper relied on a flawed data analysis, where 5 rows of data were accidentally omitted from a spreadsheet; when this flaw is fixed, countries with high debt-GDP ratios no longer have very poor growth.
In other words, an error in data analysis which was not detected by reviewers (who presumably read the paper but didnt try to reproduce the analyses) encouraged controversial policies in the UK, which many people think increased poverty.
Idea: More thorough review?
There has been a lot of discussion about peer review, but almost all of it has assumed that peer review must be done cheaply, by unpaid academic volunteers. Could peer review become a better quality assurance process if more time and effort was put into it?
Many years ago I published a paper in the British Medical Journal, and I was impressed that they asked a paid staff member to review the statistical analysis in the paper, rather than relying purely on unpaid academic reviewers. I dont know how thoroughly the reviewer checked the stats, but in principle this seems to me to be the best way to detect problems such as missing rows in spreadsheet (mentioned above). Similarly we could use paid reviewers with relevant skills to check experimental designs, whether enough information is provided for replicability, and clarity/readability. In this model, unpaid academic reviewers are mainly assessing significance, knowledge of related work, and other attributes which require deep knowledge of a specialist field.
Of course paying reviewers requires money (see below). But I think this model would lead to a peer review process which did a substantially better job of assessing the quality of a paper.
Another way of improving quality is to get someone else (not the authors) to reproduce the work described in a paper. This is considerably more work then even a beefed-up review process (as above), but is definitely a good way to detect quality flaws in research. I’ve seen a lot of replications because of my involvement in ReproGen and ReproHum, and these replications have identified all sorts of quality problems, ranging from buggy code to flawed experimental design/execution to incorrect data analyses. Most of these problems were in the details of the code, experimental design, experimental execution, or data analysis, and as such could not be detected simply by reading the relevant papers (which did not provide such details because of space limitations).
Of course replication is not always possible, some experiments are too expensive (eg, millions for hardware costs) or not feasible for generic NLP researchers. An example of the latter is Knoll et al (2022), which evaluated a note-summarisation system in clinical usage. Replicating this is very hard since it requires access to proprietary software and IT systems and also access to doctors who are willing to use the software with real patients.
But if replication is possible (which it usually is), then the above-mentioned experiences have convinced me that it is an excellent way to detect quality problems in academic research!
The above ideas all require substantially more time and effort than the current peer review process. When I discuss this with colleagues, the first question I am usually asked is who will do the work and whether they will get paid. Currently, peer review in NLP is done by unpaid volunteers. The system is already creaking because of the explosion of conferences. Currently over 10,000 papers are being submitted to xACL conferences each year. Even if we assume only 2 hours of peer review per paper, this still requires over 20,000 hours of unpaid peer review. The unpaid volunteer system is struggling to cope with this, for sure it could not cope if we switched to a more substantial QA process which required 20 hours (instead of 2 hours) per paper.
Of course, the ACL community is unusual in not having very selective publication venues. As I discussed in an earlier blog, in 2019 25% of the papers in the ACL Anthology came from the big xACL conferences (I suspect the proportion is higher in 2022). In contrast, in medicine, 0.25% of published papers appear in top rank venues. This makes it possible for the top venues (which are all journals) to have better quality assurance processes, some of which involve paid staff instead of academic volunteers. If the NLP community likewise had a highly selective journal which published only 0.25% of the papers in the field, it would be more feasible to introduce high-quality QA procedures for this journal.
Alternatively, we could have an independent organisation (not restricted to one venue) which assessed selected papers from a quality perspective. Papers could be selected by authors (who want to show that they are doing high-quality work) or readers (who want assurance that they can trust a paper)
Even if the scope of high-quality QA is limited to a small number of papers, we still will need resources for this. In many areas of science, researchers routinely pay thousands of pounds or dollars to get their work published in high-quality open-access venues. Institutions and funders (at least in rich countries) seem happy with this model, and perhaps it could be used to pay for quality assurance as well. Of course we do need to ensure that researchers from less well-off countries are not locked out of the system because of financial issues. I see that open-access publishers give discounts or waive fees for such researchers, perhaps the same model could be used here.
I think the NLP and AI fields would really benefit from better quality assurance procedures, both to identify flawed research and also to motivate authors to take quality issues more seriously when they do the original research. There are many ways of achieving this, I’ve just listed a few above. All of them require a lot more resources and effort than the current system, which is certainly a challenge, but I dont think its an unsurmountable one.