When reproducing pubished work, I’m often annoyed that methods and models are not described in detail. Even if it’s my own project, I sometimes struggle to reconstruct everything after I took a longer break from a project. An article by post-docs Rich FitzJohn and Daniel Falster shows how to set up a project structure that makes your work reproducible.
“The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.” (FitzJohn and Falster on the Nice R Blog)
To get that “mix” into a reproducible format, post-docs Rich FitzJohn and Daniel Falster from Macquarie University in Australia suggest to use the same template for all projects. Their goal is to ensure integrity of their data, portability of the project, and to make it easier to reproduce your own work later. This can work in R, but in any other software as well. Here’s their example of subfolders:
This is how they describe their folders:
- The “R” folder contains various files with functions – in STATA one could out the .do file here.
- The “data” directory has data. They stress that the data are “treated as read only”. This means that the originally downloaded or coded data are in there, and they remain unchanged. All changes, e.g. variable transformations, are made in the Rcode. The advantage is that anyone else (and yourself) can start with the same, original data file from scratch.
- The “doc” directory contains the paper. This can be word documents, LaTeX or whatever you use. I personally save versions as .v01, v.02 etc.
- The “figs” folder has all figures. They are created by the software, and are usually pdf or jpeg.
- The “output” folder has output and changed data sets. I sometimes save data sets (after logging all variables, or when creating a master table with all kinds of variables) ‘on the go’ to be able to go back to them, or send them to collaboration partners.
- “Analysis.R” is the actual script that runs all functions (from the R folder). This is R specific: you write a function, and then you ‘source’ it to run it in R. In my own work, I have all model specifications in a funciton file, and the analysis script ‘calls’ these functions and creates the output. This way the analysis file is less messy. In STATA this could be the .do file, or you could just have all STATA code in one separate folder.
Alternative: ProjectTemplate in R
When I tweeted this, Chris Hanretty, Lecturer in Politics, University of East Anglia, pointed me towards another template that is more automated [@chrishanretty]: ProjectTemplate in R, which is a “system for automating the thoughtless parts of a data analysis project”.
library('ProjectTemplate'), and then a whole series of subdirectories is created:
I have not used it myself, but the way I understand it is that every time you open a project, all data are automatically loaded in R and you’re ready to go. Similar to the structure above, parts of projects are stored in specific subdirectories. It seems to me that it takes a bit of time to set it up and get into it, but from then on you save time because everything is automated.
More on automated coding
There’s also a video on structuring your projects “R, ProjectTemplate, RStudio and GitHub: Automate the boring bits and get on with the fun stuff” on Rbloggers.