As we saw in our last post on the Top Ten Reasons Papers Get Rejected, errors in statistical analysis are among the most common grounds for rejection. Errors in the interpretation of the p-value, in particular, have long been acknowledged and unfortunately persist in scientific literature. In this article, we cover 5 of the most widespread misconceptions surrounding this statistical tool.
The Origin of the P-Value
The p-value was popularized in research in the 1920s by British statistician Sir Ronald Fisher. It was originally meant to serve as a rough numerical guide to help scientists decide which data to take seriously. Around the same time, the theory of hypothesis testing was developed by Jerzy Neyman and Egon Pearson. In hypothesis testing, data sets are compared, and a hypothesis is proposed for a relationship between variables. This hypothesis is then considered as an alternative to the default hypothesis of ‘no relationship’, which is known as the null hypothesis.
Researchers soon began combining the p-value with hypothesis testing, a method that is today widely used to accept or reject scientific hypotheses. By convention, a p-value of 0.05 is used as the cut-off point below which results are considered statistically significant.
Interestingly, what Fisher meant by ‘significance’ was simply that the data were interesting and worthy of further experimentation. Little did he know that his concept would spiral so out of control.
The Problem with the P-Value
Today, the 0.05 value has become a type of ‘magic threshold’ or ‘limbo bar’ that endows scientific results with importance if they manage to pass below it. Results that don’t pass this arbitrary cut-off value are deemed ‘insignificant’ and hidden away, never to see the light of publication.
The need for small p-values to validate the significance of research has even lead some to ‘p-hacking‘. This regrettable practice, also known as ‘data fishing’ or ‘snooping’, involves fiddling with data in order to attain low p-values. This, essentially, is cheating.
Seeking to eradicate such abuses, some journals have officially banned the use of the p-value, but not everyone agrees with taking these measures. Some argue that abolishing the p-value is akin to ‘throwing the baby out with the bathwater’, and it would be better to provide guidance on its proper use and limitations, instead.
Below are five of the most common errors surrounding the p-value. To better understand them, it will be useful to keep in mind the following definition:
The p-value is the likelihood of obtaining the observed, or more extreme, data results, when the null hypothesis is true.
Common Misconceptions about the P-Value
- The p-value is the probability that the null hypothesis is true.
Wrong. As we can see from the above definition, the p-value already assumes that the null hypothesis is true. For this reason, the p-value is also not the probability that the results are due to chance. In assuming the null hypothesis is true, the p-value is considering chance as the only reason for the differences observed.
- A low p-value means the alternative hypothesis is true.
A p-value below 0.05 does not automatically mean the alternative hypothesis is true. Low p-values suggest that the observed results are not consistent with what would be expected if the null hypothesis was true. The p-value alone cannot distinguish between test results that are unusual and a truly false null hypothesis.
- The p-value reflects the clinical importance of an effect.
The p-value says nothing about the size of an effect, or about its clinical importance. In very large studies, even small effects may gain statistical significance. For example, in a trial comparing two treatments against hypertension, patients receiving Drug A may have significantly lower blood pressure than those receiving Drug B. However, the difference might be so small, that it doesn’t make a clinical difference for the patients.
Alternatively, in small studies, even big effects may be drowned in noise.
- If the same hypothesis is tested in different studies and the p-value is above 0.05 in all or most of them, it is safe to conclude that there is no evidence of an effect.
False. Absence of evidence is not evidence of absence. Even when individual studies have p-values above the statistical significance level, it does not mean that when taken together, they might not reveal a statistical significance. For this reason, it is important to properly conduct a meta-analysis when considering the overall evidence of several studies.
- Scientific conclusions should be based on the significance of the p-value.
P-values are elusive, and they are often difficult to replicate. Therefore, it is important not to base a scientific conclusion solely on the significance of the p-value. “The p-value was never intended to be a substitute for scientific reasoning,” warned Ron Wasserstein, executive director of the American Statistics Association. It was simply meant as one way of supporting a conclusion.
The p-value in itself is not evil, but over-reliance on it can cloud judgment by providing a false sense of certainty over the validity of results. Being more discerning in the use of the p-value, and considering other statistical metrics like confidence intervals, are some of the proposed solutions to the p-value problem.
In the words of Ron Wasserstein, “Well-reasoned statistical arguments contain much more than the value of a single number and whether that number exceeds an arbitrary threshold.”
These are only a few of the most common errors in p-value interpretation. If you would like to learn more, take a look at these articles from Greenland et al, Goodman S and Jennie Dusheck from Stanford Medicine.
– Written by Marisa Granados, Research Medics Editorial Desk –