I got a lot of comments and feedback on my previous blog (Learning does not require evaluation metrics), most of which expressed concerns that there were serious methodological problems with many (not all!) papers on ML approaches to NLP, even in “good” venues. I’ve summarised some of these points below, bringing in my own perspective and experiences. Indeed, much of what people told me echoes my own disappointment and frustration when I saw my first paper on deep learning research in NLG.
Poor evaluation: A lot of ML research is poorly evaluated. I’ve written extensively about evaluation in my blog, and should have a paper appearing soon based on my structured survey of validity of BLEU. BLEU is probably the best of the evaluation metrics, but it clearly works better in some contexts than others; and even when it does correlate well with human judgements, it should be used with caution. But a lot of ML researchers ignore these subtleties, and use BLEU inappropriately. And quite a few researchers seem to just invent their own evaluation metric, present some not-very-convincing evidence that their metric means something, and then publish papers based on this metric.
Inappropriate training data (corpora): Another concern which resonated with me was that many researchers train their systems on inappropriate data sets. The most common concern was training on noisy or inappropriate data sets. One thing that really bothers me is training an ML NLG system on the output of a rule-based system. A more subtle problem is that some researchers, and indeed some subfields of NLP, repeatedly use the same data set for training and testing (using cross-validation). This leads to researchers being very familiar with the data set, which raises concerns about its suitability for testing (if the researchers knows the test data, they can design their systems so that they will do well on the test data, even without explicitly training on the test data).
Incorrect presentation of related work: Many papers are very selective in presenting related work, and hence communicating what the state-of-the-art baseline. Partially this is due to people being unaware of related work; a particular problem here is that ML researchers are often unaware and indeed uninterested in the performance of non-ML systems. But also, concerns were expressed to me that in some cases researchers are mispresenting papers they cite, in order to make their research look better.
Paper-publishing incentives favour incremental research: Some people expressed concerns that the structure of the field rewarded people who did incremental research, often with existing data sets and evaluation metrics. In other words, the most cost-effective way to publish lots of papers, even in “good” venues, is to tweak ML approaches and show that this leads to improved performance in some contexts, as measured using (often questionable) evaluation metric and data sets. A student who is working on something really different and innovative commented that publishing papers was much harder (more work) for him than for other students. A researcher told me that because he is under a lot of pressure to publish many papers, he is not able to thoroughly investigate new ideas, which he could do ten years ago.
Somewhat concerning! Of course the comments I received were biased, in the sense that people who were concerned about the above issues were more likely to contact me than people who were not. But still, I think there are some real problems here. Assuming of course that our goal is to to develop useful technologies and meaningful scientific insights. If the main goal of a “researcher” is just to get the best score in a contest, without worrying about whether the context is meaningful, he would be better off playing computer games.
I dont know what the “solution” is to these problems. One perspective is that this is a failure of reviewing as a “quality control” mechanism. Perhaps the NLP field should focus more on journal publishing (like 90% of other scientific fields), since journal reviewing (at least in my experience) is more thorough, in part because journal reviewers can have meaningful interactions with authors.
One thing that I think would help, and which I would love to see, is papers describing how well NLP systems work in real usage. In medicine, top papers need to present clinical trials which show how well an intervention worked when it was used to treat real patients. So would be great to see papers which measure effectiveness of NLP systems worked when they are used in the real world (eg, like Reiter et al 2003 or Hunter et al 2012); this would automatically address many of the above concerns about evaluation and test data. But unfortunately such papers remain rare in NLP.