How to make your work reproducible

When reproducing pubished work, I’m often annoyed that methods and models are not described in detail. Even if it’s my own project, I sometimes struggle to reconstruct everything after I took a longer break from a project. An article by post-docs Rich FitzJohn and Daniel Falster shows how to set up a project structure that makes your work reproducible.

“The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.” (FitzJohn and Falster on the Nice R Blog)

To get that “mix” into a reproducible format, post-docs Rich FitzJohn and Daniel Falster from Macquarie University in Australia suggest to use the same template for all projects. Their goal is to ensure integrity of their data, portability of the project, and to make it easier to reproduce your own work later. This can work in R, but in any other software as well. Here’s their example of subfolders:

This is how they describe their folders:

The “R” folder contains various files with functions – in STATA one could out the .do file here.
The “data” directory has data. They stress that the data are “treated as read only”. This means that the originally downloaded or coded data are in there, and they remain unchanged. All changes, e.g. variable transformations, are made in the Rcode. The advantage is that anyone else (and yourself) can start with the same, original data file from scratch.
The “doc” directory contains the paper. This can be word documents, LaTeX or whatever you use. I personally save versions as .v01, v.02 etc.
The “figs” folder has all figures. They are created by the software, and are usually pdf or jpeg.
The “output” folder has output and changed data sets. I sometimes save data sets (after logging all variables, or when creating a master table with all kinds of variables) ‘on the go’ to be able to go back to them, or send them to collaboration partners.
“Analysis.R” is the actual script that runs all functions (from the R folder). This is R specific: you write a function, and then you ‘source’ it to run it in R. In my own work, I have all model specifications in a funciton file, and the analysis script ‘calls’ these functions and creates the output. This way the analysis file is less messy. In STATA this could be the .do file, or you could just have all STATA code in one separate folder.

Read their full blog post “Designing projects“. It was re-blogged on Rbloggers (where I found it) and Revolutions.

Alternative: ProjectTemplate in R

When I tweeted this, Chris Hanretty, Lecturer in Politics, University of East Anglia, pointed me towards another template that is more automated [@chrishanretty]: ProjectTemplate in R, which is a “system for automating the thoughtless parts of a data analysis project”.

You install library('ProjectTemplate'), and then a whole series of subdirectories is created:

I have not used it myself, but the way I understand it is that every time you open a project, all data are automatically loaded in R and you’re ready to go. Similar to the structure above, parts of projects are stored in specific subdirectories. It seems to me that it takes a bit of time to set it up and get into it, but from then on you save time because everything is automated.

Read more about ProjectTemplate and check its mailinglist for troubleshooting.

More on automated coding

There’s also a video on structuring your projects “R, ProjectTemplate, RStudio and GitHub: Automate the boring bits and get on with the fun stuff” on Rbloggers.

If you work with R, you might also want to look into knitr and Sweave.

6 thoughts on “How to make your work reproducible”

Project Design as Reproducibility Aid | Matt Dickenson says:

June 7, 2013 at 1:07 pm

[…] the Political Science Replication […]

LikeLike

Best of replication & data sharing Collection 2 | Political Science Replication says:

June 12, 2013 at 4:00 pm

[…] Designing projects by Nice R Rode Blog on 5 April 2013: Two post-docs describe how to set up a perfect template for your data projects. Their project folders contain subdirectories such as: original data, code, figures, text. This post was so good it was reblogged by Rbloggers, Revolutions, and I also reblogged and extended their post here. […]

LikeLike

“I mostly had one big, ugly, long, un-readble script” | Political Science Replication says:

August 14, 2013 at 4:37 pm

[…] recently wrote about an article by post-docs Rich FitzJohn and Daniel Falster, who show how to set up a project […]

LikeLike

Science deserves better | Political Science Replication says:

September 25, 2013 at 4:23 pm

[…] authors: Do all data preparation and analysis in code. Build all analysis from primary data files. Fully describe your variables. Document every […]

LikeLike

Good practice in data collection and storing | Political Science Replication says:

March 16, 2015 at 2:01 pm

[…] data”. Other people have different folder structures but they follow the same principle. See here, or in Gandrud’s chapter on file management and data storing [pdf], or by statistician Jeff […]

LikeLike

Coding errors can be avoided | Political Science Replication says:

April 19, 2016 at 8:05 am

[…] collected in raw form, at least structure your data files accordingly on your computer. As I wrote earlier: All changes, e.g. variable transformations, should be made in the Rcode. Never touch original raw […]

LikeLike

Political Science Replication

Twitter: @PolSciReplicate