I have always that thought that it was a “no-brainer” that NLP researchers should use appropriate and high-quality data sets for training and evaluation. But I am now beginning to think that the NLP field in fact *encourages* researchers to use poor quality and inappropriate data sets, which is a depressing thought.
Junior Researcher: Easier to get papers and funding with poor data sets
About a year ago I was contacted by a junior researcher who asked me where he could get the Weathergov corpus. I explained to him that the Weathergov corpus contained the output of a rule-based NLG system, and hence using ML on Weathergov was mostly an exercise in reverse-engineering the rule-based system (ie, stealing the IP of the people who wrote the rules), not an exercise in NLG as we usually think of it. I suggested that he instead use the SumTime corpus, which contains human-written weather forecasts.
However, this researcher then told me that it was much easier to publish papers in ACL-like venues if he used Weathergov instead of SumTime (and certainly a lot more ACL papers use Weathergov than use SumTime), and also pointed out to me that the first author of a NAACL 2018 paper based on Weathergov had been awarded a fellowship from Google. In other words, it was clear to him that the best way to progress his career, in terms of both publications and funding, was to use Weathergov. So why wouldnt I help with this?
I cant blame the researcher who contacted me, he is simply responding to the incentives which he is presented with. But I think it is a very bad sign for the field that young researchers see that the way to “get ahead” is to use questionable data sets.
Reviewing: We cannot question a data set if its been used before
A recent interaction reinforced this impression. I was reviewing a paper, and was concerned that some of the data sets used by the paper were unrepresentative and otherwise inappropriate. When I raised this concern, though, one of the other reviewers said that since these data sets had been used by previous researchers, it was unfair to reject the paper on this basis. In other words, the other reviewer thought that once a data set had been used a few times in published papers, it was no longer appropriate to question its usage in papers.
I feel really uneasy about this, especially given the mixed quality of reviewing at conferences and (especially) workshops. In my mind, the fact that a data set has been used in a previously published paper does not mean that it is representative and appropriate, since I have seen many papers (even at prestige venues such as ACL) use very inappropriate data sets. I do appreciate that many researchers have a different perspective, and focus on showing that their techniques improve on state-of-the-art on existing datasets, without worrying about the relevance and appropriateness of these data sets. But in all honesty I think that if we want to make progress in NLP, both practically and theoretically, we need to work with sensible data sets.
Gresham’s Law: Do bad data sets drive out good ones?
The whole thing is very depressing, and I sometimes wonder if there is a sort of “Gresham’s Law” operating with NLP data sets. Creating a good data set is a **lot** of work; its so much easier to just grab some random stuff off the internet without worrying about representativeness, quality, diversity, reliability of annotations, etc. So if the NLP community doesnt distinguish between “good” and “bad ” data sets (after all, we can still pump out zillions of papers showing 0.5% increase on state-of-art, regardless of quality of data set), then people are likely to continue creating and using poor quality data sets. In other words, we can publish more papers if we ignore quality, and reviewers dont seem to care…
Can we do anything about this?
Can the community do anything to encourage the use of good data sets? I dont know, certainly what has happened with evaluation metrics is not encouraging. We’ve known about the problems with BLEU and other metrics for 15 years, but we still use them in contexts where they are inappropriate. It would help if reviewers, especially for journals and prestige conferences, insisted on proper data sets and evaluation techniques, but I dont know how likely this is.
I once published a paper in the British Medical Journal (BMJ), and they had a special reviewer whose job was solely to check the quality of statistical analyses and other evaluation details. I dont think this is feasible at NLP conferences (too large, too short a time scale for reviewing), but maybe this is something our journals could consider?
On a smaller scale, we should at least make researchers aware of problems with data sets. I’ve seen cases where people use poor data sets (and indeed evaluation techniques) because they dont realise there are problems with these, since the people who know about these problems do not publish this information. There’s not much I can do about this in general, but in the specific case of SIGGEN’s list of Data Sets for NLG, I will update this if I discover problems with data sets. For example, the SIGGEN list does tell you how to get WeatherGov, but also clearly states that this consists of computer-generated forecasts instead of human-written forecasts.