In spite of an undergraduate mathematics degree and statistical training in a subsequent academic fellowship, I think I had retained a relatively naive yes/no way of thinking rather than seeing various shades of maybe. My brother-in-law is a social psychologist who taught statistics and for a time studied the factors involved in voting patterns. It all seemed way too soft for the likes of me to follow. Like many physicians, I’m afraid I just listened for the almighty yes/no p-value at the end. So when I became interested in the clinical drug trials that pepper the psychiatric literature, I was unprepared for the many ways the analyses can be manipulated and distorted. I was unfamiliar with things like power calculations and effect sizes. So as I said recently, I had previously read that little thing at the top… [the abstract] without critically going over the body of the paper, making the assumption that the editor and peer reviewers had already done the work of vetting the article for me.
PLOS Medicineby John P. A. IoannidisAugust 30, 2005
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
ScienceOpen Science Collaboration: Corresponding Author Bryan NosekAugust 28, 2015
INTRODUCTION: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.RATIONALE: There is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.RESULTS: We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size [r] of the replication effects [Mr = 0.197, SD = 0.257] was half the magnitude of the mean effect size of the original effects [Mr = 0.403, SD = 0.188], representing a substantial decline. Ninety-seven percent of original studies had significant results [P < .05]. Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.CONCLUSION: No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence [such as original P value] was more predictive of replication success than variation in the characteristics of the teams conducting the research [such as experience and expertise]. The latter factors certainly can influence replication success, but they did not appear to do so here…
Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literatureby Denes Szucs, John PA Ioannidisdoi: http://dx.doi.org/10.1101/071530
We have empirically assessed the distribution of published effect sizes and estimated power by extracting more than 100,000 statistical records from about 10,000 cognitive neuroscience and psychology papers published during the past 5 years. The reported median effect size was d=0.93 [inter-quartile range: 0.64-1.46] for nominally statistically significant results and d=0.24 [0.11-0.42] for non-significant results. Median power to detect small, medium and large effects was 0.12, 0.44 and 0.73, reflecting no improvement through the past half-century. Power was lowest for cognitive neuroscience journals. 14% of papers reported some statistically significant results, although the respective F statistic and degrees of freedom proved that these were non-significant; p value errors positively correlated with journal impact factors. False report probability is likely to exceed 50% for the whole literature. In light of our findings the recently reported low replication success in psychology is realistic and worse performance may be expected for cognitive neuroscience.