Reducing Significance Threshold: Debate Over Scientific Impact
Statistical significance set at P <.05 results in high rates of false-positives, "even in the absence of other experimental, procedural and reporting problems."
In a controversial and divisive article posted July 22, 2017 on the preprint server PsyArXiv, a group of 72 well-established researchers from the same number of institutions across the United States, Europe, Canada, and Australia in departments as diverse as psychology, statistics, social sciences, and economics, led by Daniel J. Benjamin, PhD, from the Center for Economic and Social Research and Department of Economics, University of Southern California, Los Angeles, propose to improve "statistical standards of evidence," by lowering the P-value for significance from P <.05 to P <.005 in the fields of biomedical and social sciences.1 This article was published in September 2017 as a comment in Nature Human Behavior.2
Statistical significance set at P <.05 results in high rates of false-positives, note the authors, "even in the absence of other experimental, procedural and reporting problems," and may underlie commonly encountered issues of lack of reproducibility.
In an open science collaboration published in Science in August 2015, 270 psychologists seeking to assess reproducibility in their field endeavored to replicate a total of 100 studies published in 3 high-impact factor journals in psychology during 2008.3
"Reproducibility is a defining feature of science," remarked the investigators in the Science article's introduction. Reproducibility was assessed using 5 parameters: "significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes."
Surprisingly, the researchers found that "replication effects were half the magnitude of original effects, representing a substantial decline." Replications led to significant results in just 36% of studies, and "47% of original effect sizes were in the 95% confidence interval of the replication effect size." They conclude that "variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research."
However, in a comment on this large-scale replication study published several months later, also in Science, by Daniel T. Gilbert, PhD, professor of psychology at Harvard University, Cambridge, Massachusetts, and colleagues, the psychologists argue that this article "contains 3 statistical errors, and provides no support for [the low rate of reproducibility in psychology studies]."4 The comment's authors argue that, because results from the replication study were not corrected for error, power, or bias, "the data are consistent with the opposite conclusion, namely, that the reproducibility of psychological science is quite high."