Thomas Leeper is a political scientist at Aarhus University and focuses on public opinion, political psychology, and experimental methods. I invited him to write a guest post because he is a contributing developer for the rOpenGov and rOpenSci open source software projects. He also wrote the very useful dvn package that connects the statistical software R and the data archive Dataverse. In his post, Leeper explains why there is no reproducibility without proper data archiving.
Thomas Leeper is a political scientist at Aarhus University. This is his first of two guest posts on data archiving:
In a recent conversation with a senior scholar in political science, I brought up the issue of reproducibility and the need for open access to the data associated with published scientific articles. The topic was provoked by the recent symposium on data access and research transparency (DART) that this blog has covered extensively. I pointed out that this scholar’s data were not available in a persistent data archive. Their response was that the data were available on their academic website and “would be there unless I die”.
I appreciate when scholars make their data available. Putting it on a personal website is an easy way of sharing data, but it is ultimately suboptimal.
The data sharing strategy most convenient for a data author is rarely the strategy most convenient for the data end-user.
And while thinking about reproducibility on the timescale of one’s lifetime seems prudent, thinking about data use beyond that horizon is even better because it means that scientific contributions wrought today can bring value to scientific inquiry long in to the future. How can we make lasting contributions of scientific data? Basically, we need to
- put data somewhere that will last and
- document it in a way that makes it easy for others to use.
Both of these steps are actually easy but they require you to think a little bit about what it might be like for a future scientist to access your data and analysis code. This future scientist may have different software, different hardware, different training, and different research goals from you as the data creator and they may have a lot of questions about your data. But, you may already be dead.
Where should we archive our data?
This part is easy. You need to put data somewhere it will last. For decades, the go-to place for persistently storing social science data has been
That remains a great option, but there are always new options popping up.
and archives running The Dataverse Network archiving software, including the
- Harvard IQSS Dataverse Network and
- UNC-Chapel Hill’s Odum Institute.
All of these services provide a persistent data archive, meaning that archived data are promised to be available as long as the hosting institution can support it. And, all are backed up by LOCKSS a Stanford-created distributed network of backups that mean data are safe even if the original server hosting the data is physically destroyed. Persistent data archives promise a lifespan for your data that far exceeds your own.
These sites also share other nice properties. They provide DOIs, the persistent pointer to your data regardless of where it lives on the internet. This makes data citable and Dataverse, in particular, stores data under version control so that corrections can be made to published data while still preserving the replicability of results based on earlier iterations. All three sites also integrate beautifully with the open source statistical software R (due to the work of the rOpenSci project), making it ease to archive data as part of the usual scientific workflow. (Full disclosure: I’m the author the dvn package that connects R and Dataverse.)
With such great online resources available, there’s no reason to distribute data on one’s own website. Your data are in better hands at any of these sites.
Which files to archive?
Read more about what kinds of files to archive, and how to be a better data scientist, in a second guest post by Thomas Leeper [coming soon].
Thomas Leeper is currently a postdoc at Aarhus University and previously received his PhD in Political Science from Northwestern University. His research focuses on public opinion, political psychology, and experimental methods, and he is a contributing developer for the rOpenGov and rOpenSci open source software projects. You can follow him on Twitter (@thosjleeper) and at GitHub.
[…] You have some data, and you’ve done your analysis and written up your paper. Now you want to make your data available to others. As Thomas Leeper has previously written on this blog, it’s a good idea to put your data in an archive. […]