I recently posted a list of data repositories that facilitate reproducibility. I now came across a specialized data archive at Yale. This one seems different: it aims at verifying code and data that are submitted. Is such quality control feasible?
In the blog The Role of Data Repositories in Reproducible Research, Limor Peer discusses how data repositories can take over responsibilities beyond being a mere archive. Peer discusses the ISPS Data Archive at Yale, which concentrates on social science research data from experiments, and contains data for topics such as voting, public opinion, taxation and advertising effects.
The special approach of this repository is its quality control. Peer writes:
“We think that repositories do have a responsibility to examine the data and code we receive for deposit before making the files public, and that this data review involves verifying and replicating the original research outputs. In practice, this means running the code against the data to validate published results.”
At the ISPS Data Archive, data are reviewed by “examining the data and code received for deposit and verifying and replicating the original research outputs”. This is indeed a big difference compared to other, larger repositories, such as the Harvard Dataverse. A large repository that allows scholars to upload replication data or teaching materials quickly and easily can hardly provide such a rigorous review process.
Can curation work with large repositories?
The data curating approach of the Yale ISPS is admirable and a good step in the right direction. So far, even if data are available online, scholars usually find it very difficult to replicate existing work due to missing code or unclear codebooks.
However, there is a tension between rigorous quality control efforts, accessibility, quick publication of data, and appeal to a wide scholarly community. The ISPS repository contains only a comparatively small number of data, and covers specialized topics. For a larger repository – for the sake of usability – such a review process is not feasible, at least not at the moment. Peer states: “A larger lab, greater volume of research, or simply more data will require greater resources and may prove this level of curation untenable.”
Maybe quality control and detailed data curation can still be introduced to larger repositories in a step-by-step approach. Student projects could include assignments of replication including data check, code check etc., and in entries on the respective data verse where the data are stored one could mark those studies as “curated” or “checked by…”. Integrating forms of quality control similar to the Yale repository into larger repositories via teaching could be a crucial way forward to ensure reproducibility of more general social science studies.