The process of selecting reviewers for ACL conferences has changed out of recognition over the past 30 years. When I published my first ACL paper in 1990, all reviewing was done by 11 very senior people. Of course ACL was a much smaller event in 1990, with 39 papers accepted, but even so my guess (I don’t have exact data) is that each of these senior researchers was expected to review dozens of papers. Anyways, although no reviewing process is perfect, people who submitted to ACL 1990 knew that their papers would get reviewed by top researchers in the field.
Moving forward to 2006, in this year I both published a paper at EACL and was an EACL area chair. Reviewing had moved to a two-stage model, where 10 area chairs were each in charge of a topic (I was area chair for generation), and each area chair recruited reviewers for his topic (eg, I recruited reviewers for NLG papers). There were a total of 240 reviewers, and the area chairs and programme chairs got together for a physical meeting in Brighton, where we discussed borderline papers as a group. Reviewing was not up to the standard of ACL 1990, but I was still satisfied (both as author and as area chair) that it was of good standard. Although by this stage I was also becoming convinced that EACL/ACL was not the best venue for really important papers, such papers should be published in journals because of the better reviewing process.
Moving forward to 2020, my involvement in ACL conferences is much reduced, both as an author and as a reviewer (I am even more convinced that good work should be published in journals). But I must say that I don’t like what I see. ACL reviewing is now done with a three-stage model, where dozens of senior area chairs supervise hundreds of area chairs, who in turn supervise thousands of reviewers. Reviewers are not selected by (senior) area chairs (as they were in 2006), instead everyone who submits a paper to ACL can be asked to be a reviewer. I’ve not looked at the papers accepted by ACL 2020, but certainly I’ve been unimpressed by papers presented at recent ACL conferences which used a similar model. A lot of papers are scientifically dubious (I wrote a blog about one of these, there are many more), and even more are scientifically uninteresting. I’ve heard many other people make similar complaints, so I don’t think this is just me being an “old fogey” who is harking back to the “good old days”.
Perhaps we need properly selective publication venues
If you compare NLP to other scientific fields, one of the really odd things is the lack of truly selective publication venues. In medicine, almost one million citations were added to Medline in 2019. However, the top rank of medical journals (New England Journal of Medicine, Lancet, British Medical Journal, Annals of Internal Medicine, Journal of the American Medical Association, Nature) just publish a few thousand research articles a year (I’m just counting full research articles here, not opinion pieces, letters, etc). In other words, in medicine something like 0.25% of published papers appear as full research articles in top-ranked venues.
In NLP, I guess most people consider the ACL, EACL, NAACL, and EMNLP conferences, along with TACL and CL journals, to be the most prestigious venues. In 2019, approximately 1250 full papers were published in these venues (not counting short papers and demos). The ACL Anthology as a whole contains a bit less than 5000 papers for 2019 (I believe a similar number were submitted to the Computation and Language section of arxiv). So in the ACL world, 25% of papers are published in our “prestige” venues. Quite a contrast from the 0.25% figure in medicine!
From a reviewing perspective, if we had truly selective prestige venues, we could focus reviewing resources on these venues, so that their papers went through a rock-solid reviewing process with top experts in the field. The big conferences could then become more like conferences in other areas of science, where a light-touch reviewing process is used to filter out papers that are off-topic, incomprehensible, or completely bogus, but no attempt is made to distinguish “high quality” from “unimpressive” research. I think this model is much more sensible considering the number of NLP papers published in recent years. The field is **much** bigger in 2020 than it was in 1990, and practices that worked in 1990 do not scale up to the much bigger field in 2020.
In other words, it simply is not possible to do consistent high-quality reviewing on 25% of all papers published in our field; maybe we could do this in 1990 when NLP was much smaller, but we cannot do this in 2020. Of course many people are working very hard and doing their best as reviewers, area chairs, and program chairs to make the current model work, and I appreciate their efforts! But in all honesty I think it would be better if we switched to a different publication and reviewing model, taking inspiration from what other scientific fields do.
We already have venues with excellent reviewing
What is especially frustrating to me is that the NLP field already has venues which publish relatively small numbers of papers after a high-quality reviewing process, namely the journals Computational Linguistics and Transactions of the ACL. In a rational world, these would be high-prestige venues for our best papers. Certainly at a personal level, I still assume that papers published in CL and TACL are probably good research contributions. I make no such assumption about ACL (etc) papers; some are excellent, but there is also a *lot* of dross.
I remember a discouraging conversation I had a few years ago with a PhD student (not my student), who told me that he was going to submit a paper to the ACL conference. I suggested he consider TACL instead, telling him that TACL’s better reviewing would really improve his paper. The student replied that he would submit to the ACL conference because (A) it was easier to get into, (B) it was less work (no need to waste time revising a paper once its accepted and on his CV), and (C) ACL conference publications were more prestigious than TACL journal papers in his home country. As in other cases, students are simply responding in a logical way to the incentives they face; the problem is the incentive structure, not the student.
There isnt much I can do about reviewing practices in NLP, but at a personal level I’m pretty close to giving up on reviewing for ACL, EMNLP, etc. To be completely honest, I suspect that people who submit to these conferences are not looking for high-quality reviewing (if they were, they would submit to CL or TACL), so why should I spend lots of my time reviewing their papers?