I recently wrote about an article by post-docs Rich FitzJohn and Daniel Falster, who show how to set up a project structure that makes your work reproducible. Here, they explain why they had “one big, ugly, long, un-readble script” before they started a reproducible filing system, and how they are trying to persuade their own lab to change.
“The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.” (FitzJohn and Falster on the Nice R Blog)
Their goal is to ensure integrity of their data, portability of the project, and to make it easier to reproduce your own work later. This can work in R, but in any other software as well. Here’s their example of subfolders:
Read details about the structure here.
When did you start using a project template?
Daniel Falster [DF]: “About a year ago, after seeing how Rich set up his code in a collaborative project we were both involved with. At the time, I thought ‘this is great, why aren’t I doing something this?’ Using a sensible layout for files is such a simple idea, but not something most of think about, it seems.”
How did you work before you set up such templates?
[DF]: “I mostly had one big, ugly, long, un-readble script. It worked, it was somewhat reproducible, but was difficult to read, navigate, and use. Also, my projects were often not standalone, ie. the data may be off somewhere else on my hard drive, and my scripts had lots of hard coded directory names in them. That kind of setup is quite fragile.”
[RF]: “I always dread going into really old projects. The general problem with my older projects is that the structure is too ‘flat’, and there is nothing to remind me of what different files are really meant to do. And even with the sort of set up that we’re advocating things get out of hand and need tidying up quite frequently.”
Why should people use a project template?
[DF]: “Using a predictable and sensible layout helps
– you find files when you need them
– others navigate your project
– ensure the project runs as a standalone project, by encouraging you to put all relevant materials in that directory
– keep your data safe, by putting it in a ‘read-only’ folder
– ensure suitable materials are tracked in version control (e.g. data and scripts) and keep other things out (generated outputs)”
[RF]: “A couple of reasons stand out for me; the first is that data is really hard-fought and precious (definitely true for ecological data, but I suspect other fields too). Often people will edit their core data files to modify bits of data, or add a transformation in. Quickly it’s not clear what was original collected data, and what parts were really ‘analysis’.”
Have you ever not managed to reproduce your own work, and what happened?
[DF]: “Many times! That’s why I’m focussing more on reproducibility in my current workflow. So far there haven’t been any serious consequences, but from now, I want all new projects I lead to be fully reproducible.”
[RF]: “Not since getting more serious about organising my projects. I still routinely go into old projects (3-4 years old) and grab bits of code, especially for creating figures. My general aim is that every project I currently work on could be entirely regenerated by someone else by running one command — by this I mean all analyses, generated figures, tables. I’ve only got one published paper that meets this, partly because of unresolved issues about how to make analyses that require weeks or months to run reproducible.”
Does your whole lab use such templates?
[RF]: “No; this sort of thing doesn’t seem to be enforced, or even discussed, amongst ecologists. It’s not a criticism of people, but there’s not yet a culture of this yet.”
[DF]: “No, but we’re encouraging some of them to, via a ‘nice R code’ course we are running. The nice R code blog was started to distribute material from that course to a broader audience. The post on project setup was from week 1 of the course.”
What is the status on reproducibility in ecology and evolution at the moment, and what else – apart from a project template – can improve that?
[DF]: “There are hardly any papers which are truly reproducible, but momentum is gathering, quickly. By end of next year, I hope there will be a large number (>200) of papers which are truly reproducible, i.e where the data, the scripts, and the manuscript text are all open, so you can just download the git repo and recreate the entire work. Once that starts happening, I’m hoping journals will start hosting, and even demanding these materials for publication.”
[RF]: “Reproducibility is gaining visibility with ecology and evolution, but it’s still a small niche. Greg Wilson (who runs software carpentry, and who cares deeply about these things) has pointed out that for most people, reproducibility just isn’t that important. This is probably true from the point of view of published data — and the litany of unreproducible results support it. However, working this way saves me time, so that’s why I do it. If it helps other people that’s great, too.
The project structure is only one of a bunch of important pieces that promote reproducibility. The other key components in my mind are (1) version control (we use git), (2) testing (which helps identify behaviour changes early) and (3) automated build scripts. We use “make” mostly for this. There are people far better at this than us who have started using “Travis CI” for continuously assessing that a project works as soon as changes are made.”
Do you use library(‘ProjectTemplate‘) – the automated R version that creates a reproducible structure?
[DF]: “No, and I didn’t know about it before now. It looks interesting, although the number of folders is possibly a bit overwhelming for many students. While we are encouraging people to adopt a a good layout for their project, we don’t have fixed ideas on what this should be. You should come up with a layout that suits you, and will be easy for one of your colleagues to follow. But I don’t think we need to be too prescriptive.”
[RF]: “I hadn’t seen this before! Looks like a great way to get started. One thing I prefer about our structure is that it naturally transitions into the R package structure, which is helpful if packages are something that arises from your research. I agree totally with Daniel here though; the structure should not be too prescriptive — it’s there too help, not get in the way. In that sense, so long as whatever people use is documented and easy to follow, if it works for them it’s probably good.”