Thomas Leeper, a political scientist at Aarhus University, recently wrote about where to store your replication data. In his second post, he explains what kind of data to archive, and why that makes you a better scientist. His post is packed with concrete steps and state-of-the-art software tips.
Thomas Leeper is a political scientist at Aarhus University. This is his second guest post on data archiving:
What should we archive?
In order to make our scientific efforts reproducible, we need to think like the future users of our data. What information and resources do they need available to them in order to reproduce our results? It’s easy to imagine yourself in that position. Just think about the last time you want to look at someone else’s data and replicate their results. (If you’ve never been in that situation, just try it sometime for fun.)
- Was the data in an open science format (as opposed to stored in a way that you couldn’t understand or couldn’t open)?
- Was there a codebook describing what the data actually were?
- Was there a complete code file containing the code that translated data into published analyses?
- Was there a record of what versions of software were used in those analyses?
If you’ve ever been in this situation, a “no” answer to any of these questions likely caused problems for you in replicating the original analysis or using the data to other ends.
Once you’ve experienced those challenges as a data user, you know that they should guide your behavior as a data producer. Just putting the Stata file (version unknown) of your data on your website without any supplemental information means not only that others may have difficulty understanding it, but also that in the future users may be unable to open it. Ease for the data creator is inversely proportional to its utility for end user.
Here’s what a rough outline of what your persistently archived data package should contain:
1. Data
Data needs to be shared in an open scientific format. Most political science data is simple and rectangular, so comma-separated value (CSV) or tab-separated value (TSV) files are the obvious choice. For larger datasets, fixed-width formats tradeoff ease of use with file size; binary formats like NetCDF might also work in specific applications.
Stata, SAS, SPSS, and Excel files should be verboten. (Note, for example, that R will not read Stata files beyond version 11.)
Beyond sharing your data, good practice is to cite the specific version of the data you use in case changes are made to the dataset after you use it (e.g., because errata are published on the original dataset).
2. Metadata
By moving your data to a file format like CSV, you necessarily lose all metadata – the long-form variable descriptions and possibly variable labels that help users make sense of your data. I was taught in graduate school to write a Word (.doc) file containing all of this. A plain text file would probably be better.
Recommended current practice, however, is to create a Data Documentation Initiative (DDI) file. Creating one by hand is tedious, but programs like SledgeHammer or Colectica will make them for you. Dataverse also generates DDI automatically when data are archived.
3. Analysis
If you’re publishing replication data, you need to include analysis. These files should produce every number, table, and graphic in the published article. Most scientists have such files on hand, but making them usable to others requires ensuring that the files
- are complete (i.e., contain every analysis, with no point-and-click operations),
- are organized in a comprehensible way (i.e., it shouldn’t be 10,000 lines of code in one file),
- and are carefully documented to describe what you did.
On this last point, Jake Bowers has a terrific article describing how to document code. Even if your code doesn’t work in the future, your comments should supply enough information for others to recreate what you did from scratch in the language of their choice.
4. README
A README helps the end-user make sense of the complete package of archived materials. Sharing data and code is great, but it can be hard to make sense of because each of us uses different coding styles, different software, and different scientific workflow. Describing how the complete package of data files come together to yield the published results can be really helpful.
5. Software
Software changes all the time and those changes might be consequential for reproducibility. It might seem weird to write down that you produced your results on May 4, 2014 using Windows 7, R v3.1.0, and specific versions of add-on packages, but this information could be vital to reproducibility. This information could be in the README, the code, and/or your original published article. Fully citing software in the article has the nice advantage of giving appropriate credit to scientists who often thanklessly produce the software we use everyday.
Conclusion: Be a better Scientist
This might all seem like a lot of work, but distributing a package of reproducible research files will make you a better scientist and make your scientific output more credible and longer lasting. When you think about data archiving as part of the output of the scientific process, you’ll work better.
For example, when you focus on documenting the production of each table and figure in your paper, you’ll learn to stop using programs like Excel or other graphical user interfaces to produce results because they are fundamentally not reproducible.
You’ll also help yourself out: if you ever find yourself as the future end-user of your own data, you’ll appreciate having a complete package of files that are easy to make sense of. And putting all of this material out for the world to see should help you feel confident in your results. Even if someone challenges your findings, at least you’ve been honest about how you came to your results and been willing to let others have a look.
It’s an honorable part of science to have your work challenged; it’s fundamentally dishonest to hide your data and methods. And if you’ve made your data open and analyses reproducible in the long-term by storing them in a persistent data archive, your data and code will be there to defend your work even if you’re long gone.
Read Thomas Leeper’s first post on where to upload data.
Bio
Thomas Leeper is currently a postdoc at Aarhus University and previously received his PhD in Political Science from Northwestern University. His research focuses on public opinion, political psychology, and experimental methods, and he is a contributing developer for the rOpenGov and rOpenSci open source software projects. You can follow him on Twitter (@thosjleeper) and at GitHub.
[…] I’m suggesting here is that you could also provide tabular data in a delimited text format (like .csv). This means that people can use your data in whatever analysis environment they like (well, except […]
LikeLike