“I don’t have a ready-made dataset.” – “We don’t have the R code for our paper available.” – “I’m travelling. I will definitely send the replication data when I can clean it up a bit.” These are just some of the answers I received when asking authors for their replication data in political science. Only few sent me replication-ready data and almost no one sent me their code or .do file. This is a serious case of replication frustration!
“Political science is a community enterprise; the community of empirical political scientists needs access to the body of data necessary to replicate existing studies to understand, evaluate, and especially build on this work. Unfortunately, the norms we have in place now do not encourage, or in some cases even permit, this aim.” (King 1995)
This quote stems from Harvard professor Gary King. He pushed for more replication in political science over 15 years ago (Replication, Replication. PS: Political Science and Politics 28: 443–499). When I started my doctoral research in 2009, I thought things would have changed drastically by now. This is not the case.
Originally, I did not set out to ‘improve’ political science reliability standards by replicating papers. I simply wanted to learn statistics. Whenever I came across an article on my topic foreign investment and human rights in developing countries in journals for political economy, quantitative human rights and comparative politics, I tried to have a look at the data, and, in some cases, to replicate the results. The main goal was not to ‘check’ if they authors ‘did it right’, but simply to improve my quantitative skills. I reproduced tables and figures, re-ran models, or transformed data. I wanted to find out what’s the ‘standard’ procedure when working with panel data on foreign investment. (Later I mostly checked if results with STATA matched with results in R.)
Data set not available on request
I found that many journals do not provide replication data sets on their webpage. Some don’t even have a link called ‘data’ or ‘replication’ or the like. I started writing polite emails directly to the authors. After all, many mention that ‘on request’ they would send all data sets and code. While I got some helpful replies, the majority either did not answer, or was not able to get the data together any more. Countless times I read they had to ‘dig up’ the data, ‘clean up’ their files, or transform the data sets into a ‘presentable’ format. Sometimes my follow-up emails were ignored.
“Without adequate documentation, scholars often have trouble replicating their own results months later. Since sufficient information is usually lacking in political science, trying to replicate the results of others, even with their help, is often impossible.” (King 1995)
Other authors replied that the data were ‘bought by their institution’ or could not be distributed due to ‘contractual agreements’. While I do understand this, I question this practice. I wonder if at least the journal reviewers of these articles received the data to check the results before they were published.
Emails I received stated:
- I will definitely send a .dta file when I can clean it up a bit.
- I will see what I can find in my files for you.
- We don’t have R code available for the imputations.
- I don’t have a ready-made set of do files and datasets, although I would be happy to collect these for you once I have access to the files.
- I’m travelling.
- I only have some of my electronic files with me during my trip.
I copied the above sentences directly from emails I received over the last years. Unsurprisingly, I became more and more frustrated – not just because this made it much harder to learn statistics, but also because I began to question professional practice in my field. How can we understand and assess empirical results without information about the data and how models were run? How can we develop better methods and answers to research questions when we are not sure what the ‘previous research’ really found out?
When I talked to people at my department I heard that I was even lucky to work in a quantitative field, where at least some data are made available. In the qualitative world, it is even harder to get interview transcripts, field notes, or information why certain people were interviewed and not others. Reproducibility for qualitative case studies is practically zero.
From my experience, the main problems are:
- many journals still do not publish replication data or have a replication policy
- if a journal publishes replication data, they are not easily found
- many journals do not require authors to upload replication data before they submit the article
- many authors do not keep track of their data sets
- many authors do not keep R code or STATA .do files for their models
- many authors do not remember how they transformed their variables
- some authors do not answer to emails about replication
- some authors do not follow up with ‘cleaned up’ data sets even if a gentle reminder is sent after a few months.
“Solutions to many existing problems in empirical political science are best implemented by individual authors.” (King 1995)
I do agree that authors should be responsible for making their data available; it is in their own interest. Some authors are pioneers and make sure to provide their data by uploading them to their personal webpage, for example Gary King, Harvard; M. Rodwan Abouharb, University College of London; Christopher Adolph, University of Washington; Eric Neumayer, London School of Economics and Political Science; Layna Mosley, University of North Carolina at Chapel Hill; Thomas Plümper, Essex.
However, not every researcher has time or the skills to maintain a personal webpage. At Cambridge, it can take a while to get the IT Department to update faculty webpages. A great way to get around this problem are data collection platforms like The Dataverse Network. The Dataverse Network also accepts data sets for book publications, which is especially useful.
Still, a lot of data are not available, neither through authors, journal webpages or data platforms. Publishers and journals need to step up here. Some journals at least accept supplementary data files, even if they do not force authors to upload them (e.g. World Development “accepts electronic supplementary material to support and enhance your scientific research”). A more fruitful way is to have a clear replication policy. The American Journal of Political Science states that “the manuscript will not be published unless the first footnote explicitly states where the data used in the study can be obtained.” The Journal of Conflict Resolution (JCR) “will not publish any articles (in print or online) until the Editor has received all the necessary replication materials.” In the JCR, each article links (under the “full text pdf”) to the data set. Other journals make it even easier to find all data in one archive, e.g. the Journal of Peace Research or International Studies Quarterly.
My replication wish list for political science
To ensure that political scientists can “understand, evaluate, and build upon a prior work” (King 1995) I propose this wish list:
- All political science journals should require that authors first upload the data sets and software code for all models used, before the submission system lets them send in an article. This will ensure that replication policies are actually implemented (which is not always the case, see here).
- Authors should keep detailed records about data sets, models, R code and .do files for all results presented in the full text, footnotes and appendix.
- Authors should not just ‘promise’ to send data ‘on request’, but do so.
- There should be an open access Journal of Political Science Replication to boost replications not only for learning purposes but also as a valuable contribution to the field.
Why a political science replication blog?
I have exchanged many fruitful emails with researchers who were more then helpful and willing to discuss their data and models with me, and who shared their work openly. But, unfortunately, these were exceptions. On my blog I’m collecting ideas about how to improve replication in political science; why replication is important; and I publish some personal (anonymized) correspondence – helpful and frustrating – about my own replication endeavors. This is not to offend my colleagues. I do understand that everyone is busy and might not be able to dig up old files or write lengthy emails about an old project. But I do hope that this blog contributes to a change in attitude about replication in political science. After all, in the natural sciences there is a long standing debate about replication. They haven’t solved the problem, but they are way ahead of us.
- Symposium on Replication in International Studies Research, in: International Studies Perspectives Volume 4, Issue 1, pages 72–107, March 2003.
- King, Gary. 1995. Replication, Replication. PS: Political Science and Politics 28: 443–499.
- The Dataverse Network
- Replication debate in other fields