Will integrating original studies and published replications always improve the reliability of your results? No! Replication studies suffer from the same publication bias as original studies. In her guest post, Michèle B. Nuijten, who focuses on statistical errors and data manipulation in psychology, presents two solutions to this problem.
Imagine that you’re a researcher, and you’re trying to estimate the effect size of a certain phenomenon. You conduct a literature search and find a large study on the effect, and also a small replication study. How do you evaluate this information?
It turns out that the vast majority of scientists (independent of their expertise) will choose to evaluate both studies to calculate an overall effect size (see our survey). Their intuition seems very logical: “the more information, the better.” This is also in line with a vast amount of literature that states that replications will improve estimates and decrease the rate of false positives. However, this intuition is false.
What these scientists forgot to take into account is that published studies are highly influenced by publication bias: statistically significant studies find their way into the literature much easier than non-significant ones. This is worrying, since only publishing significant effects leads to overestimated effect sizes.
But wait. Isn’t this what replications are meant for? To obtain more accurate effect size estimates?
In theory, yes. However, there is evidence from meta-analyses and multi-study papers that replication studies suffer from the same publication bias as original studies. This means that both published studies and published replications contain overestimated effect sizes.
The implication of this is that if you combine the results of a published study with those of a published replication, your effect size estimate might actually get worse as compared to an effect size estimate based on only one study.
This works as follows
The bias in the effect size estimate in a single study depends on both the amount of publication bias and power. Studies with low power will have imprecise effect size estimates that can range from severe underestimations to severe overestimations. However, if you would take the average of all these underpowered studies, you would still end up with a fairly accurate estimate of the true effect size – if there is no publication bias. When there is publication bias, only the severe overestimations (the ones that will be significant) will show up in the literature. Combining these effect sizes leaves you with an overestimated effect size.
On the other hand, studies with high power do not have this problem. Their effect sizes are estimated with more precision, and you will not have the severe under- and overestimations as in low powered studies; all estimated effect sizes will be more closely centered on the true effect size. This means that even if when there is publication bias and only the significant studies will be published, the average effect size will not be as distorted as with underpowered studies.
We can apply the above logic to the replication scenario from the first paragraph: in the literature you come across a large study on a particular effect, and a small replication study. How do you evaluate this information? If we assume that both studies are affected by publication bias, then the effect size estimate in the large original study might be (slightly) overestimated. However, the effect size estimate in the small replication will be much more overestimated. In this case, evaluating only the original study will give you a more accurate effect size estimate than when you would also include the information from the replication study.
In a nutshell: a replication will increase precision (the confidence interval around your effect size estimate will become smaller), but a replication adds bias if its sample size is smaller than that of the original study, but only when there is publication bias and power is not high enough.
Two solutions to this problem
- The first solution would be to get rid of publication bias entirely. This would solve the problem immediately: no publication bias, no systematic overestimation of effects. There are some great initiatives, such as large scale, pre-registered replication projects, but we’re not there yet. Plus, what to do with all the studies that are already published and did not benefit from anti-publication bias measures?
- The second solution is to make sure that the studies you include in effect size estimates have high power. If a study has high power, the inflating effects of publication bias on the effect size estimate will be less severe. To put it more crudely: your effect size estimate will be more accurate if you completely discard studies with low power.
Although we don’t know exactly what the influence of publication bias is on the effect sizes published in scientific articles, there is evidence that publication bias is omnipresent. Therefore, it may be wise to assume a worst-case scenario.
For researchers to play it safe, when they are considering to combine results of several studies to estimate effect size, they should change their intuition from “the more information, the higher the accuracy,” to “the more power, the higher the accuracy.”
About Michèle B. Nuijten
Michèle B. Nuijten is a PhD student at Tilburg University. Her research focuses on the detection and prevention of statistical errors and data manipulation in psychological research. Twitter: @MicheleNuijten
This blog is based on the paper (submitted) “The Replication Paradox: Combining Studies Can Decrease Accuracy of Effect Size Estimates”, by Michèle B. Nuijten, Marcel A. L. M. van Assen, Coosje L. S. Veldkamp, and Jelte M. Wicherts, Tilburg University.
 Literature on how replications suffer from publication bias:
Francis, G. (2012). Publication bias and the failure of replication in experimental psychology. Psychonomic Bulletin & Review, 19(6), 975-991.
Ferguson, C. J., & Brannick, M. T. (2012). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods, 17, 120-128.