I recently discovered that a paper at an INLG conference had also been presented at another ACL-affiliated event, and was listed twice in the ACL Anthology. This is inappropriate and goes against ACL rules, which essentially is that archival publications must contain novel/original research content. However I also realised that rules on “double publication” may not be clear to many people, especially if they are new to NLP and AI. Below I try to explain (my understanding of) the rules on this.
Principles of Double Publication
According to Wikipedia
Multiple submission is not plagiarism, but it is today often viewed as academic misbehavior because it can skew meta-analyses and review articles and can distort citation indexes and citation impact by gaming the system to a degree. It was not always looked upon as harshly, as it began centuries ago and, besides the negative motive of vanity which has always been possible, it also had a legitimate motive in reaching readerships of various journals and books that were at real risk of not otherwise overlapping.
Another complaint is that duplicate publication wastes the time of editors and reviewers.
The point about reaching a wider audience may have some validity. Years ago I submitted a paper to CHI about some text-vs-graphics findings which I had already published in INLG, but which I thought would be of interest to the CHI community (which doesnt tend to read INLG papers). I was open about this and CHI told me they couldnt accept my paper because it had been published elsewhere. Fair enough, but I still think my work would have been of interest to some CHI researchers!
A longer presentation of principles of “overlapping publications” is available from ICMJE (focusing on medical research, but the principles are universal).
What is an archival publication?
Double-publication rules apply to “archival” publications, but what exactly does this mean?
In most fields of science, journal publications are “archival”, and everything else (including conference papers) are “non-archival”. Ie, you can present results multiple times at conferences and workshops, but you can only present them once in a journal paper. This is a clear and simple rule, but it is not the one used in NLP, AI, and CS.
In the ACL world, an “archival” publication is basically one that appears in the ACL Anthology. This includes
- Journal papers, eg Computational Linguistics and Transactions of ACL
- Conference and workshop papers, unless these were submitted to a “non-archival” track (see below)
Arxiv papers, by the way, are always “non-archival”, so a paper can appear both in Arxiv and in a conference proceedings or journal. However, ACL does impose some constraints on submitting a paper published in Arxiv to ACL events, to enhance blind reviewing.
Many ACL workshops have “non-archival” tracks which explicitly allow submissions of papers that are being published elsewhere; such papers do not appear in the workshop’s ACL Anthology proceedings. To give an example, I am helping to organise a workshop on NLG and Healthcare. Last week, someone told me that she is working in this area and would love to come to the workshop and present and discuss her research. However, she is submitting her work to a larger conference, so she cannot submit a paper which will appear in our workshop’s official proceedings. We (workshop organisers) are keen to encourage such people to participate in our workshop, so we set up a “non-archival” track; researchers accepted to this track can present their research at the workshop, but their paper will not appear in our workshop’s ACL Anthology site.
I think non-archival tracks are a great idea since they broaden discussions and participation at workshops, and I’m going to suggest such tracks for all future workshops I am involved in! They can be confusing, though. In the example (mentioned at the beginning of this blog) of the double-presented INLG paper, the second presentation was at a workshop which allowed both archival and non-archival submissions. I suspect that the author (who is new to the field) got confused and didn’t fully understand what was expected for each type of submission. In fact he has now agreed that his workshop paper should be considered non-archival and removed from the workshop’s ACL Anthology proceedings (but remain in the INLG section of the ACL Anthology).
What is also confusing is that I am now seeing non-archival duplicate publications appear in proceedings of major conferences. In particular, IJCAI has an explicit track for “Sister Conferences Best Papers”, which have all been published elsewhere, and whose “extended abstracts” are listed in the IJCAI proceedings. Thus, for example, the paper Beyond Accuracy: Behavioral Testing of NLP Models with CheckList appears in its full form in the ACL Anthology and in a shortened form in IJCAI proceeedings. I understand why IJCAI does this, but adding this kind of thing as a special/exceptional case does “muddy the waters” and make it harder to explain the rules for duplicate publication.
What is duplication?
The other definitional question is when two papers are similar enough to be considered as duplicates. The above-mentioned ICMJE guidelines essentially states that if an authors submits a paper which has includes material which has been published elsewhere, the author should explicitly indicate which parts have been published elsewhere and which parts are novel. Venues can then decide if the novel content is sufficient to warrant publication, with different venues applying different rules.
In NLP and AI, there is a tradition of publishing extended versions of conference papers in journals. For example, an expanded version of the INLG2019 paper Best practices for the human evaluation of automatically generated text was published in the Computer Speech and Language journal as Human evaluation of automatically generated text: Current trends and best practice guidelines. I have done likewise, for example Anya Belz and I published a Computational Linguistics (journal) paper in 2009 which was an expanded version of a paper we presented at EACL 2006. I have also seen cases where expanded versions of workshop papers are published in conferences. In all of these cases, there is an expectation that there is substantial new content in the expanded paper which was not in the original paper.
I have also seen many cases where two conference papers or two journal papers (with overlapping authors) shared a lot of content. I’ve not seen any hard rules on how much sharing is acceptable, to me the guiding principle has always been that the novel content should be sufficient to warrant publication in the venue. Certainly in my own work, if you exclude journal papers which expand conference papers, I think my papers have generally had less than 25% overlap with previously published papers. With a few exceptions, eg the journal paper Hunter et al 2012 is essentially an expanded version of an earlier journal paper Hunter et al 2011.
I can understand newcomers to NLP get confused, especially if they are coming from an area of science which follows the simple rule that journal papers are archival and nothing else is! I hope the above makes sense (and is accurate), and I welcome questions and comments.