I often recommend that researchers do a “sanity” check on experiments. That is, manually inspect some (A) test/train data, (B) model/system output, and (C) evaluation output, looking for anything that seems strange. The purpose is to detect “bugs” such as incorrect data, models “cheating”, and code bugs. These unfortunately are fairly common, and an hour or two spent eyeballing the above will detect many (of course not all) such problems. Well worth doing to reduce the chance of spending weeks or months on buggy experiments!
Data problems
Unfortunately a lot of the datasets used in NLP and AI contain flawed data. Even worse, unless the dataset is very high profile, data problems are usually neither publically reported nor fixed. This was brought home to me a few weeks ago when a colleague told me that he was using a respected dataset from 2017, and had discovered that some of the annotations he was interested in were often wrong. He had not known this (the fact was not advertised), and he asked me if it was possible to let other people know; I could not think of a reliable way to do this.
Even the most prominent datasets often have problems, although these may be reported. For example the MMLU benchmark was widely used to evaluate LLMs even though around 10% of it was wrong; at least this fact was published (Gema et al 2024). Obscure datasets on Kaggle and Huggingface are *very* likely to have serious problems. The underlying issue is very poor quality control on datasets created by academics. Companies tend to take quality assurance more seriously, but even here there are problems.
Anyways, to give a concrete example of sanity checking, last year I suggested that one of my PhD students consider participating in a shared task. She looked at some of the training data, and discovered that much of it made no sense. She showed this to me, and I agreed that participating in this shared task was not a good use of her time.
Model problems
AI models are very good at “cheating” (blog), that is solving problems in a way that does not give an indication of real-world performance. Common techniques include data contamination, where the model finds the test set on the Internet and regurgitates it, and reward hacking, where the model corrupts the evalation process (blog). Of course the models are not maliciously cheating, they are simply solving the problem in the most efficient manner. And copying answers from an Internet version of the test set is much more efficient that solving the problem from scratch.
It is possible to investigate model behaviour in detail for “cheating” (eg, Hamin and Edelman 2025). However, this takes a lot of time, and I have found that many such cases can be detected simply by looking for outputs that are amazingly good, ie “too good to be true”. For example a student recently showed me a case where a simple model got 100% accuracy on a difficult task, and I told him that this was probably data contamination.
So if you run an experiment and get amazing results, check for above before telling the world! This is what physicists and biologists do, and we should do this as well.
Evaluation problems
Many AI and NLP evaluations suffer from code bugs, reporting errors, and other “execution” problems. Indeed, when we looked at reproducing experiments from five papers as part of the ReproHum project, we discovered that *every* paper we reproduced had such problems (Thomson et al 2024). This should not be a surprise. After all, software engineering tells us to expect at least one bug in every 100 lines of code even in commercially developed code with extensive software testing. Since most research code does not go through a formal software quality assurance process, it seems like that it has more than one bug per 100 lines of code.
In other words, the code we use to run evaluations and analyse results is almost certainly buggy. Other problems include reporting the wrong numbers (numbers in paper do not match experimental data), distorted analyses (eg, by inappropriately dropping outliers), and ethical lapses.
Anyways, we found in ReproHum that many such errors could be detected by looking for anomalies in the experimental result. For example, we saw a case where two systems had identical performance, which was surprising, so we investigated and realised this was because of a software bug in preparing system outputs for evaluation. It may be possible in the future to automatically detect some of these problems; Bianchi et al 2025 show that this is already becoming possible for writing errors like incorrect references.
Final thoughts
There is a temptation in AI research to simply download a data set, construct a pipeline which automatically runs the data through a model and evaluator, and then focus on experimenting with the model in order to improve evaluation score. Without ever looking at the data or model outputs. I think this is a mistake. AI research artefacts are buggy, and if we do not check for this, there is a chance that our experiment will be meaningless.
Exhaustively checking for data quality, model cheating, and software bugs take a lot of time and effort, and probably is not appropriate for most academics. But it is possible to detect a lot of problems by just spending an hour or two checking some random data, model outputs, and evaluation results for behaviour that does not look right, and I think everyone who cares about meaningful research should do this.