Its reviewing season again, ie the time of year where I am asked by academic conferences and journals to review papers submitted to these events. As usual, a few are excellent and thought-provoking, a few are dire, and most are somewhere in between.
Anyways, one comment that I find myself making quite often, even on otherwise excellent papers, is that people should use two-tailed rather than one-tailed p values. This is a fairly dry statistical point, but it is important. I try to explain this below, starting from the basics.
What is a one/two-tailed p value?
When we propose a new algorithm, model, system, etc in AI, we usually want to compare it to existing algorithms (etc) and show that it is better. For example, suppose I have a new recommender algorithm for books. I can use A/B testing to see whether customers who use the new algorithm buy more books than customers using my old algorithm.
One issue, at least in academic/research contexts, is that such experiments are affected by luck. For example, if the customers who used the new recommender algorithm in the A/B test happened (just by chance) to be a bit richer than the customers who used the old algorithm, then they probably would buy more regardless of the algorithm. So if my A/B experiment shows that sales are slightly higher for the new algorithm, is this because the algorithm is better or because the people using it happened (just by luck) to be richer?
To address this concern, we can use statistical tests to assess the liklihood that the result we saw was due to chance. For example, suppose our A/B test showed that people using the new recommender algorithm bought 8.73% more than people using the old one. How likely is this to be due purely to chance? In other words, if our new algorithm is no more effective than the one old (the null hypothesis), what is the chance that we would see a 8.73% difference in purchases purely because of differences in customers and other such random factors?
Well, the chance that we should see a difference of exactly 8.73% is miniscule, effectively zero. So instead, we ask what is the probability that we would see a difference of at least 8.73% (ie, 8.73% or higher) if in fact there was no difference between the effectiveness of the algorithms. This chance is called the p value (probability value).
But what do we mean by “difference of at least 8.73%”? Do we mean
- What is the chance that people using the new algorithm would buy at least 8.73% more than people using the old one, if the algorithms were in fact equally effective? This is the one-tailed p value.
- What is the chance that people using one of the algorithms would buy at least 8.73% more than people using the other one, if the algorithms were equally effective? This is the two-tailed p value.
In other words, when computing the likelihood of seeing a difference of XX percent or more just because of luck, do we include cases where the new algorithm is worse than the old algorithm (bad luck) as well as cases where it is better?
From a practical perspective, two-tailed p values are usually twice one-tailed values. For example, suppose the probability that users of the new algorithm will buy at least 8.73% more purely because of luck is 0.0326 . Then most likely the probability that users of the old algorithm will buy at least 8.73% more purely because of luck is also 0.0326 . Hence the one-tailed p-value is 0.0326 , and the two-tailed p-value is 0.0326+0.0326 = 0.0652 . So one-tailed p values are lower than two-tailed ones, which makes them attractive to researchers who are trying to show that their experimental results are not just due to luck.
Note that only a few statistical tests (such as t-test and Pearson correlation) can give both one-tailed and two-tailed values. Most statistical tests (such as Anova and chi-square) just give a single p value.
Why Do I Recommend Two-Tailed Values?
There is nothing wrong with one-tailed p values from a mathematical perspective, and they are very easy to convert into two-tailed values (just divide by two). But I still recommend against their use in scientific papers in NLG, NLP, and AI.
The biggest problem with one-tailed p-values is comparability with other results. Most scientific papers present two-tailed p-values, which means that its easier to assess how a new paper fits into the literature if it also uses two-tailed p-values. Perhaps even more importantly, two-tailed p-values are much easier to compare with results of tests that dont make the one/two-tailed distinction..
For example, when I am comparing purchases amongst users of different recommendation algorithms, I can use a t-test or an ANova. The t-test is simpler, and can give one or two-tailed p-values. ANova is more powerful, and for example allows me to compare more than two algorithms and to incorporate other factors (such as previous buying history) into the comparison. But ANova only has one p-value, the one-tailed vs two-tailed distinction doesnt make sense for it. And ANova’s p-value is similar to the two-tailed p value from a t-test, not the one-tailed p-value.
So if I do a t-test and present a two-tailed p-value, then I have the freedom to switch to ANova at a later date, and I will get comparable results; my results are also comparable with other researchers who use ANova. I will lose this comparability if I present a one-tailed p-value from my t-test.
A related point is that it is easier to compare the ressults of a two-tailed p value to a 95% confidence interval, which is another standard way of showing the results of a statistical test.
The other problem with one-tailed p-values is that they tend to make reviewers (such as me) suspicious. I have seen a few papers over the years where the use of one-tailed p-values was carefully and rigorously justified by people who clearly had a deep understanding of statistics. I have seen *many* more papers where the authors didnt seem to have a clue about the above issues, and just used one-tailed p values because it gave them better looking (ie, smaller) results. Especially if the two-tailed p-value was above the usual “significance” threshold of 0.05, and the one-tailed p-value was below this threshold.
So what this means if that I review a paper and see one-tailed p-values being used, this triggers an “alarm bell” in my mind, and I become suspicious about the whole experiment unless the authors make it very clear that they have a really good reason for using one-tailed p-values (which rarely happens).
Unless you really know what you are doing statistics-wise, you should stick to two-tailed p values. And even if you do have a deep understanding of stats, you should only use one-tailed values in exceptional circumstances, which you will need to carefully explain and justify in your paper.