20 years ago, I spent several hours chatting to a speech recognition researcher while we were both at a boring workshop. At the time, the speech community was dominated by papers about modifying ML models in order to show small improvement in state-of-art on some data set, but this kind of paper was less common in NLP, and unheard of in NLG. I asked my colleague why all of these 1% improvements did not translate into rapid progress overall. Ie, if lots of people were publishing papers showing how to improve word-error-rate (WER) by 1%, then by putting these together shouldnt we be seeing dramatic improvements (50% reduction in WER?) in speech recognition systems? But this was not happening; at the time speech systems were improving slowly, not dramatically. Why was this?
My colleague (who had become a bit cynical) said that he thought that the vast majority of published improvements in state-of-art were due to overfitting test data, and not generally applicable. To put it crudely, his view was that the major speech conferences were full of papers which were making bogus claims about ideas which did not in fact work in a general speech recognition context.
Twenty years later, I am beginning to think that this is true of NLP as well. Most of our large conferences are dominated by papers where someone modifies a deep learning model and shows a small improvement in state-of-art on some task, data set, and evaluation metric. I’ve complained elsewhere about the fact that a lot of the data sets and evaluation techniques are dubious; indeed a lot of the tasks are also dubious (weird tasks of no real-world utility, or gross over-simplifications of real-world tasks). But even more fundamentally, I suspect that overfitting test data is rife, and much of what gets published even in “prestige” events is scientifically worthless.
I should say that many other people are saying similar things, I dont claim that the below insights are original to me!
Overfitting a data set
The first thing my friend mentioned 20 years ago was that a lot of techniques only worked on specific data sets. For example, a technique which reduced WER in a telephone dialogue system for flight bookings might have no impact on WER when transcribing notes dictated by a doctor. And even in telephone dialogue systems, a technique which worked for flight bookings might not work for cinema inquiries. Indeed, techniques which worked on one telephone flight booking dataset might not work on a another telephone flight booking dataset, because the techniques were sensitive to participants (eg, accent), city names used in bookings, etc.
From a scientific perspective, the key thing here is to make the scope clear in our hypothesis and claims. Eg, explicitly say that a technique is effective in telephone dialogue systems but not in medical transcription, or that it works for RP accents but not for Scottish accents. But most speech researchers did not do this 20 years ago; they just published results showing state-of-art increase on a dataset, without saying anything about whether their techniques would generalise to other data sets.
I’m seeing similar things now in NLP. For example, I’ve recently seen several multimedia papers which showed increases in state-of-art performance on a dataset which consists of episodes from a TV series. Now TV shows are very different from real-world multimedia contexts, but none of the papers which I saw discussed or even acknowledged this. So we dont know whether their “state-of-art improvements” will actually materialise in real applications.
Repeated attempts on the same test data
A lot of data sets (especially from shared tasks) come with suggested partitions into training, validation, and test data. If researchers who use this data set (after the shared task has finished) repeatedly stick to this partition, then they will be repeatedly testing their ideas on the same test data. Which is dangerous; basically it allows the researcher to overfit his algorithm to one specific test data set.
For example, the E2E challenge had a fixed test set which contained 630 inputs (meaning representations). Which means that we can get a 1% improvement in performance of a system simply by identifying 6 inputs where our system does poorly, and tweaking the system to handle these specific inputs better. The E2E test set also contains more complex inputs (on average) than the training data (see Table 3 in Dusek et al), this again gives opportunities to tweak our system to enhance test set performance once we are aware of this.
I dont mean the above as a criticism of the E2E challenge!! This kind of thing is inevitable; it is statistically inevitable that a randomly selected test set will differ in some ways from the training set, and a lot of NLP test sets are small enough that improving performance on a small number of cases will give us 1% increase overall.
Anyways, the above did not have an impact on the systems which participated in the original E2E challenge, since these were developed without seeing the test data. But subsequent researchers who use the E2E data set do have access to the test data when they develop their system, so they can optimise their algorithms to do better on specific test cases and/or try to tune their algorithm so that it fits the statistical peculiarities of the test data set. This overfitting to the test data makes it relatively easy to outperform the original E2E systems on the fixed E2E test set (but perhaps nowhere else). Note that this overfitting “optimisation” is done when developers design and modify the model, it doesnt involve explicitly training the model on the test data (which most people realise is unacceptable).
Such overfitting can be done deliberately after reading the above-cited paper and studying the test data set. It can also be done “unconsciously” (which is perhaps more common) by trying different models on the E2E test data, and choosing the one which works best. Regardless, if a researcher overfits to to a known test data set, his or her results are scientifically questionable.
Repeated attempts on different test data sets
But what if a researcher uses different test data every time? For example, in a context where new data is coming in every day, what happens if the researchers simply tests his or her model on the day’s data, which they havent seen before?
This strategy suffers from multiple hypothesis issues if the researcher simply reports the best result. For example, lets say the researcher tries out 100 different algorithms on 100 different days, and gets the best results from algorithm 37. So he then writes a paper about how great algorithm 37 is, using the data from day 37, with a comparison to the current state-of-the-art algorithm. The question here is whether algorithm 37 did well because it is a good algorithm, or because the day 37 data set was (by pure luck) a good match to the strengths of algorithm 37. The laws of probability tell us that if we do the above experiment, some algorithms are going to get “lucky” in their test data. The correct way to do this kind of experiment is to use 100 days data to choose the most promising algorithm (as above), but then test the winner on a new data set (day 101?) and report these findings, not the findings from day 37 (or whatever). This gets rid of the “luck factor”.
A related issue is iterating an experiment until we get a good result, and then stopping. For example, lets say we do the above process and then get lousy results, so we change a few things and repeat the exercise (another 101 days of data). But this also is disappointing so we make more changes and repeat a third time. Here we finally get good results, which we publish. It turns out that this strategy is very similar from a multiple hypothesis perspective, and suffers from the same issues.
So how do we keep ourselves honest and avoid overfitting the data? From a pragmatic perspective, it certainly helps to test your models within a shared task (at the time the task is run, not afterwards). Hopefully the shared task organisers will choose sensible and representative data sets (which doesnt solve the dataset overfitting problem, but does help a bit). More importantly shared task participants dont see the test data until the end, which helps with the “repeated attempt on same data issue”. And there is one test data set, so we cant try 100 different test sets and just report that one that made our system look good.
Ultimately, though, the only real way to detect if a result is due to overfitting is to have other people (preferably not close friends or colleagues) test your models and ideas on other data sets and in other contexts. One encouraging thing is that I am seeing an increasing number of such replication studies in NLP, although they are still much less common than papers which present models.
I am worried, though, that a lot of NLP researchers dont seem to understand the issues I describe above, and indeed may not even care. They simply want to show that their model beats state-of-the-art on a dataset, and have no interest in overfitting issues (or other methodological problems) as long as their paper gets accepted and published. In an ideal world reviewers would reject papers that rely on overfitting, but a lot of the above issues come from the experimental process (eg, are researchers reporting the success on day 37 and not mentioning the 99 failures), and this information is usually not provided in research papers.