I recently had a chat with a newish PhD student at Aberdeen about his work, which involves ML. He showed me several experimental results. His algorithm did slightly better than baseline in the first experiment, and hugely better in the rest of the experiments. We discussed this, and it turned out that in his first experiment, he was comparing against a published result as a baseline. Whereas in the other experiments, he was comparing against baselines he had himself implemented (in some cases based on algorithms in previous papers). I asked about tuning (eg, hyperparameters), and it turned out that he had carefully tuned his algorithm, but had not tuned any of the baselines. I suggested that this raised doubts about his results, and he needed to use rigorous baselines for proper research; he promised to discuss this with is supervisors.
On a similar vein, I pointed out that some of the baselines he had implemented had suspiciously low results, barely better than the “choose-most-common” trivial baseline. Which made me wonder whether they had been properly implemented. Also his algorithm had suspiciously high results in some cases (eg, 100% accuracy), which made me wonder about the appropriateness of his test set.
This was a fairly new PhD student, so its not surprising that he got some things wrong in his research; after all, the whole point of a PhD is to learn from established academics how to do proper research! But what concerns me is that I suspect established researchers do similar things; indeed I recently saw some comments on Twitter claiming that poor baselines were a widespread problem in ML in NLP. Essentially the easiest way to make your work “look good” is to compare against a poor baseline, and people often (usually? almost always?) get away with this in papers, since its hard for reviewers (esp in conferences) to judge the quality of hyperparamater tuning in a baseline, etc.
If you want to rigorously show that your approach is superior to existing approaches, then you need to compare your results to results from previous approaches!
New Algorithm for Standard Problem and Dataset
If you’re working on a standard problem and dataset (eg, participating in a shared task), then you should compare your results to the best previously-published results on the task and dataset. Relatively straightforward.
Algorithm for Variant of Standard Problem
If you’re working on a variation of a standard task, such as working in a new domain or using a new data set, then life is trickier. If you can get the code for the best-performing algorithm on the standard task, you could use it as your baseline, but to make the comparison fair you need to retune for the new domain and/or data set. There may also be issues with the code not working properly because it makes assumptions about character sets, presence of proper names, etc. This assumes you can get the code; if you cannot, you can try to reimplement it, but need to be very careful about introducing bugs and otherwise hurting performance. Also, the best performing algorithm in the new domain may not be the best performing algorithm in the standard version of the task.
So in an ideal world, you would take the top N best performing algorithms in the standard task, carefully recode them if necessary, and then carefully tune and optimise them for the new domain or data set. Which is a lot of work, especially considering that you probably hope that the baselines do poorly so that your algorithm will shine… But you do need to take this seriously if you want to do proper science! One approach is to get someone else to implement, tune, and evaluate the baselines; this can reduce conflict of interest.
What if you are working on a new task for which there is no previous work? In this case, its not clear what the right baseline should be. Ideally you would try a bunch of standard techniques, being careful about implementation and tuning; but this is a lot of work, and its very easy (and indeed tempting…) to omit some potentially promising baselines.
If you are working on a new task, then you could pitch your research contribution as the task itself (and accompanying data set if appropriate), and present your algorithm as an initial attempt which could serve as a baseline for future research. I personally am quite sympathetic to such papers when I act as a reviewer.
Regardless of the novelty of your task, you should always question and investigate your results! If a baseline does really badly, you need to understand why. Is there something fundamentally wrong with it, or are there problems with implementation or tuning? Similarly, if your algorithm does amazingly well, eg 100% accuracy, you need to investigate whether this is a brilliant breakthrough, or whether there is (for example) a problem with your test data.
In general, any numerical evaluation should be accompanied by a qualitative analysis of why the results came out the way they did. This is the best way to identify the above-mentioned problems. And even if your research is solid, the qualitative analysis will probably give you good insights on improving your algorithm.
One thought on “Use Proper Baselines!”