This year, I am co-teaching an MSc course on Evaluating AI Systems, and I got very passionate earlier this week, when I started talking about the dangers of multiple hypotheses. Probably the most passionate I’ve been in a lecture for several years, I apologised to the students at the end for the “sermon”. But it is a topic I feel very strongly about.

### The Problem

In AI and indeed most of science, we routinely compute the statistical significance of an experimental test of a hypothesis. That is, we compute the probability that we would see the observed experimental result if our hypothesis was false; this is the “p value”. If the p value is less than a threshold (we usually use p < 0.05 in NLP, AI, and CS; high-energy physics uses the much lower “5 sigma” threshold of p < 0.0000003), we say that our experimental result is statistically significant, and confirms our hypothesis. I realise this summary is a bit of an over-simplification, but I think its good enough for my purpose here.

Anyways, this is fine if we’re testing one hypothesis. But what if we are testing 100 hypotheses? Even if all of these hypotheses are false, the chance that at least one of them will seem to be confirmed at p < 0.05 is (1 – 0.95**100) = 0.994 . In other words, even if all of our 100 hypotheses are false, the probably that at least one of them will have p < 0.05 is over 99%!

This is a real problem in medicine, for example, and has been criticised by researchers investigating the replication crisis, where results which are supposedly statistically significant cannot be reproduced by other researchers. For example, in genetics, there is a lot of interest in finding genes whose carriers are more likely to have a specific disease. But if we test 100 genes to see if their carriers are more likely to have autism (for example), the above calculation shows that there is over a 99% chance that one of the genes will seem to be linked to higher autism rates with p < 0.05, even if none of the genes actually are linked to higher autism rates.

The same thing happens if we try different outcome measures or statistics. For example, suppose I evaluate an NLG system on three different measures (perhaps human ratings of clarity, accuracy, and utility) and try three different statistical tests (perhaps t-test, Mann-Whitney test, and ANova). In this case, the chance of getting a “spurious” statistical significance result (ie, p < .05 even if all the hypotheses are false) is (1 – 0.95**(3*3)) = (1 – 0.95**9) = 0.37 . Ie, over a 1/3 chance that I’ll get a statistically significant result even if my system is useless!

Worst of all is post-hoc tweaking of hypotheses, measures, or stats. Another researcher once told me “I didnt get the significant result I wanted, but not a problem! I just loaded the data into SPSS and started playing around with different stats and subgroups, and quickly found a plausible-sounding hypothesis which was statistically significant, which I published”. This kind of thing, where the researcher searches the set of possible hypotheses (and stats) until he find something that looks good, renders any statistical test meaningless. In other words, if you do this, you are pretty much guaranteed to find a hypothesis which seems significant at p < 0.05 (even if none of the hypotheses are); but the fact that you are guaranteed to find such a hypotheses means that this is a meaningless exercise which proves nothing.

### Solutions

Dealing with multiple hypotheses is not rocket-science. We can apply a Bonferroni correction, ie reduce the p value threshold by the number of hypotheses being tested (eg, look for p < 0.0005 = 0.05/100 if we are testing 100 hypotheses). Alternatively, we can conduct a pilot experiment to identify which hypothesis we think is most promising, and focus on this hypothesis in our main experiment. A related concept is using a validation data set in machine learning to choose a model and hyperparameters, so only a single system is tested in the final evaluation. There are other solutions as well.

Dealing with post-hoc tweaking is harder. Of course education so people realise this is wrong! Medical researchers are expected to register a clinical trial on a website such as clinicaltrials.gov or www.clinicaltrialsregister.eu before they conduct the experiment, with details on hypotheses (etc). Then journals and regulators can check submissions against the website and detect posthoc tweaking. We havent done this in NLP yet, and probably its overkill, but would do no harm for researchers to write down hypotheses (etc) before they conduct an experiment, and refer to this when they write up results.

But a lot of NLP and AI researchers dont bother, they present zillions of hypotheses in a paper without any sort of correction, and/or post-hoc tweak hypotheses to get a better result without mentioning this in their paper. I suspect that a lot of these people dont even realise they are doing anything wrong when they do this, or think “everyone else does this, why shouldnt I” (ie, its part of our scientific culture, which is NOT good).

Medicine and other fields such as psychology have faced up to the “replication crisis” and the fact that a lot of supposedly significant published results are garbage, in many cases because of multiple hypotheses. I suspect that the situation is just as bad (and indeed perhaps worse) in AI and NLP, but we have not faced up to this yet.