Reviewing has changed over the years; conferences need to change as well

The process of selecting reviewers for ACL conferences has changed out of recognition over the past 30 years.  When I published my first ACL paper in 1990, all reviewing was done by 11 very senior people.    Of course ACL was a much smaller event in 1990, with 39 papers accepted, but even so my guess (I don’t have exact data) is that each of these senior researchers was expected to review dozens of papers.    Anyways, although no reviewing process is perfect, people who submitted to ACL 1990 knew that their papers would get reviewed by top researchers in the field.

Moving forward to 2006, in this year I both published a paper at EACL and was an EACL area chair.   Reviewing had moved to a two-stage model, where 10 area chairs were each in charge of a topic (I was area chair for generation), and each area chair recruited reviewers for his topic (eg, I recruited reviewers for NLG papers).  There were a total of 240 reviewers, and the area chairs and programme chairs got together for a physical meeting in Brighton, where we discussed borderline papers as a group.  Reviewing was not up to the standard of ACL 1990, but I was still satisfied (both as author and as area chair) that it was of good standard.  Although by this stage I was also becoming convinced that EACL/ACL was not the best venue for really important papers, such papers should be published in journals because of the better reviewing process.

Moving forward to 2020, my involvement in ACL conferences is much reduced, both as an author and as a reviewer (I am even more convinced that good work should be published in journals).  But I must say that I don’t like what I see.  ACL reviewing is now done with a three-stage model, where dozens of senior area chairs supervise hundreds of area chairs, who in turn supervise thousands of reviewers.  Reviewers are not selected by (senior) area chairs (as they were in 2006), instead everyone who submits a paper to ACL can be asked to be a reviewer.  I’ve not looked at the papers accepted by ACL 2020, but certainly I’ve been unimpressed by papers presented at recent ACL conferences which used a similar model.   A lot of papers are scientifically dubious (I wrote a blog about one of these, there are many more), and even more are scientifically uninteresting.  I’ve heard many other people make similar complaints, so I don’t think this is just me being an “old fogey” who is harking back to the “good old days”.

Perhaps we need properly selective publication venues

If you compare NLP to other scientific fields, one of the really odd things is the lack of truly selective publication venues.  In medicine,  almost one million citations were added to Medline in 2019.  However, the top rank of medical journals (New England Journal of Medicine, Lancet, British Medical Journal, Annals of Internal Medicine, Journal of the American Medical Association, Nature) just publish a few thousand research articles a year (I’m just counting full research articles here, not opinion pieces, letters, etc).   In other words, in medicine something like 0.25% of published papers appear as full research articles in top-ranked venues.

In NLP, I guess most people consider the ACL, EACL, NAACL, and EMNLP conferences, along with TACL and CL journals, to be the most prestigious venues.  In 2019, approximately 1250 full papers were published in these venues (not counting short papers and demos).  The ACL Anthology as a whole contains a bit less than 5000 papers for 2019 (I believe a similar number were submitted to the Computation and Language section of arxiv).  So in the ACL world, 25% of papers are published in our “prestige” venues.  Quite a contrast from the 0.25% figure in medicine!

From a reviewing perspective, if we had truly selective prestige venues, we could focus reviewing resources on these venues, so that their papers went through a rock-solid reviewing process with top experts in the field.   The big conferences could then become more like conferences in other areas of science, where a light-touch reviewing process is used to filter out papers that are off-topic, incomprehensible, or completely bogus, but no attempt is made to distinguish  “high quality” from “unimpressive” research.   I think this model is much more sensible considering the number of NLP papers published in recent years.  The field is **much** bigger in 2020 than it was in 1990, and practices that worked in 1990 do not scale up to the much bigger field in 2020.

In other words, it simply is not possible to do consistent high-quality reviewing on 25% of all papers published in our field; maybe we could do this in 1990 when NLP was much smaller, but we cannot do this in 2020.  Of course many people are working very hard and doing their best as reviewers, area chairs, and program chairs to make the current model work, and I appreciate their efforts!  But in all honesty I think it would be better if we switched to a different publication and reviewing model, taking inspiration from what other scientific fields do.

We already have venues with excellent reviewing

What is especially frustrating to me is that the NLP field already has venues which publish relatively small numbers of papers after a high-quality reviewing process, namely the journals Computational Linguistics and Transactions of the ACL.  In a rational world, these would be high-prestige venues for our best papers.   Certainly at a personal level, I still assume that papers published in CL and TACL are probably good research contributions.  I make no such assumption about ACL (etc) papers; some are excellent, but there is also a *lot* of dross.

I remember a discouraging conversation I had a few years ago with a PhD student (not my student), who told me that he was going to submit a paper to the ACL conference.  I suggested he consider TACL instead, telling him that TACL’s better reviewing would really improve his paper.  The student replied that he would submit to the ACL conference because (A) it was easier to get into, (B) it was less work (no need to waste time revising a paper once its accepted and on his CV), and (C) ACL conference publications were more prestigious than TACL journal papers in his home country.    As in other cases, students are simply responding in a logical way to the incentives they face; the problem is the incentive structure, not the student.

Final Thoughts

There isnt much I can do about reviewing practices in NLP, but at a personal level I’m pretty close to giving up on reviewing for ACL, EMNLP, etc.   To be completely honest, I suspect that people who submit to these conferences are not looking for high-quality reviewing (if they were, they would submit to CL or TACL), so why should I spend lots of my time reviewing their papers?

3 thoughts on “Reviewing has changed over the years; conferences need to change as well

  1. I agree with the main points of your post (including the observations about CL and TACL and the need to empower them more). Just a few comments about the comparison to medicine/MedLine and having a really selective elite:

    – I don’t think MedLine is directly comparable to the ACL Anthology because although the latter is growing, it has much less coverage than MedLine and is still far from being a comprehensive resource. For example, according to a Google search there were more than 60 Spanish journals (mostly publishing articles in Spanish) indexed in MedLine. I don’t think the ACL Anthology has any content in Spanish at the moment, and there are various Spanish venues including conferences, workshops and at least one journal that I know of.

    – According to conference rankings (at least the ones we use in Spain), ACL > EMNLP > NAACL > EACL. So if we have to choose a “super elite” of NLP conferences analogous to Nature, The Lancet, etc., it would probably be ACL alone (the others could be analogous to the first quartile of JCR journal impact factors, I guess: venues considered very good, but not the absolute best). I know many people in the community claim that the four are equal, but this is because many people in the community don’t care much about rankings or conference prestige in the first place…

    – Even with these nuances, obviously the “elite” of NLP conferences probably publishes much more than 0.25% of the papers. But does having such a selective (0.25%) set of elite venues really have an impact on most people? I suppose not even 10% of medical researchers dare submit to BMJ, The Lancet, etc., let alone get accepted. A somewhat larger elite could be more inclusive.

    – This is compounded by the fact that elite journals are often criticised for allegedly sacrificing scientific rigor to impact, see for example this piece by Richard Sproat in CL: http://mitpdev.mit.edu/system/cogfiles/journalpdfs/coli_a_00011.pdf


  2. Many thanks for your comments, and you are absolutely right that MedLine has better coverage of medical research than ACL Anthology does of NLP research.

    About publishing in BMJ (etc), I co-authored a paper in BMJ many years ago (http://www.bmj.com/cgi/content/full/322/7299/1396), and this paper was the result of several years of work by a research team. Which I think is pretty typical of papers in top-grade medical journals, and the medical researchers I worked with were very happy that their years of work resulted in a BMJ paper. I suspect expectations about the amount of work behind an ACL paper are very different; certainly many papers I see in ACL seem to be the result of person-months of work, not person-years.

    If expectations changed and papers in top NLP venues usually represented person-years of effort, with other venues being used to publish the results of smaller studies, then the top venues would publish many fewer papers, and I suspect would be more similar to top medical journals.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s