This we know: Sharing research data with the goal of advancing science is slowly becoming the norm in many disciplines, and a rich ecosystem has sprung up in recent years to support that effort. Yes, technological and cultural challenges remain, but anyone watching this space would agree that much progress has been made. A guest post by Limor Peer and Ann Green.
It is a positive sign that one aspect of research data sharing and preservation – data review – is increasingly part of the conversation. Data review can be defined as “a process whereby data and associated files are assessed and required actions are taken to ensure files are independently understandable for informed reuse. This is an active process, involving a review of the files, the documentation, the data, and the code.”
Why data review is crucial
As argued elsewhere, data review has taken on a new urgency because of the growing variety of means available for depositing and accessing data. Data are increasingly recognized as accessible, persistent, and citable research objects, and so is the requirement that data be interpretable and usable from the get-go. Data review is crucial to the appropriate dissemination and preservation of scientific data for two reasons.
- First, data review serves research transparency. Transparency about the research process addresses the credibility crisis in science by answering the question, Can I trust these results? This is especially important in disciplines where the data or the recorded event cannot easily be recreated. Transparency is a starting point for any future data use and analysis – including generating hypotheses, integrating with other data, comparing with other studies – which is the foundation of science.
- A separate but related reason for data review is the long-term usability of shared data. A concern with usability lends itself to the view that such a review should be done before the data are shared – for example, examining data files for sensitive data or private information, keeping track of data versions, and optimizing file formats for long-term preservation. The usability lens also strongly suggests that data repositories play an important role.
Data review is gaining ground
Here’s some evidence that data review is gaining ground:
- Work is underway to develop software and tools that help with data review. These tools present an attractive solution to actors – e.g., repositories, journals, libraries, researchers – who may have limited capacity to engage in data review. Some early tools seem to focus on certain key review tasks such as format transformation or standard-based metadata generation or enhancement. More comprehensive tools are in the works. To the extent that such tools are widely adopted and easily integrated into the data publication workflow, they can help make a significance difference in whether and how data (and additional files) are reviewed.
- Journals are implementing data policies, with varied results in terms of standards, enforcement, and review. There is some evidence that they are increasingly considering their role in reviewing data and code files they require. Dryad’s integration with PLOS makes it easy for reviewers to access the data. In political science, the journal Political Analysis states that “replication materials” are “reviewed for completeness.” The Dataverse journal plugin comes with a boilerplate data policy template which includes a fairly comprehensive review checklist. Nature’s journal Scientific Data publishes “Data Descriptors” which are publications dedicated to describing data; the review includes the data by definition. Note that unlike repositories, journals are motivated by quality, not necessarily preservation.
Recent initiatives pushing for improved data review
The issue of data review is given time in professional meetings, summer courses and at conferences.
- At last year’s International Digital Curation Conference in San Francisco, a data publication workshop included a discussion of review and spurred some noteworthy thoughts. And at this year’s meeting, the issue of “unusable data” – and what “a responsible view of that” might look like – was singled out at a panel on International Perspectives on Open Research and Curation.
- At the Research Data Alliance, the publishing data workflow group is analyzing “a representative range of existing and emerging workflows and standards for data publishing.” The initial analysis reported at the RDA Fifth Plenary Meeting in San Diego earlier this month found that discipline-specific repositories have the most rigorous ingest and review processes, while “more general institutional repositories have a lighter touch.” The work builds on the UK-based PREPARDE project, which examined the issue of data (peer) review, including thought-provoking recommendations to connect data review with data management planning and to connect scientific review, technical review, and curation.
- Also at the RDA Fifth Plenary, the challenge of proper documentation for dynamic datasets or for versions of programming code was discussed at several sessions. The data citation group is working on making dynamic data citable by easy DOI assignment to subsets of data, which can be linked back to the original. The RMap project attempts to link and track disparate but related research objects. In addition to this interesting work, a few new groups are forming. We believe that this focus on the evolving nature of research objects will underscore the need to prioritize data review and to more clearly delineate roles and responsibilities.
- On May 27, 2015 Dryad will be holding a community meeting titled, “Taking a closer look at data.” The day-long conference includes a session devoted to data review as an “emerging issue.” The session will “consider whether and how data review may be more widely adopted by Dryad’s community in the future to improve the value of data for reuse.”
- The popularity of such workshops as the ICPSR summer course on Curating and Managing Research Data for Re-use is encouraging. The workshop will be offered for the third time this year and is predicted to be filled to capacity. The Digital Curation Center’s Research Data Management Forum is also offering a one-day event this spring on preparing data for deposit. Such workshops have great potential to engage researchers.
We’d like to think that these initiatives and conversations are an indication that data review is gaining ground. A year ago, together with Libbie Stephenson, we wrote: “We call on the community as a whole to commit to data review by practicing it and by demanding to know when it has been done. Our hope is that it becomes a cornerstone in standard approaches to data curation and will become common practice once appropriate tools and frameworks are in place.” Today, we feel optimistic that the community is headed in the right direction.
What’s next? Tackling software issues and code review. Stay tuned…
About Limor Peer
Limor Peer, Ph.D., is Associate Director for Research at the Institution for Social and Policy Studies (ISPS) at Yale University. She oversees research infrastructure and process at ISPS, including the Field Experiment Initiative, which encourages field experimentation and interdisciplinary collaboration in the social sciences at Yale. In this capacity, she has led the creation of a specialized research data repository (the ISPS Data Archive) and is currently involved in campus-wide efforts relating to research data sharing and preservation. At ISPS, Peer also heads the website team, and is responsible for research-related content on the site. Prior to joining ISPS, Peer was Research Director at Northwestern University’s Media Management Center and Readership Institute, which focuses on applied research primarily in the areas of media audience, content, and management strategy. At Northwestern University, she was also Associate Professor (clinical) at the Medill School of Journalism and held a courtesy appointment in the Department of Communication Studies. Her research interests include the media’s role in democracy and in the public opinion process. She has taught courses on communication theory, public opinion, media and society, the future of the media, and statistics for journalists. Peer received her Ph.D. and M.A. in Communication Studies from Northwestern University and a B.A. (summa cum laude) in Political Science from Tel-Aviv University.
About Ann Green
Ann Green is an independent research consultant focusing upon the digital lifecycle of scholarly resources, including their creation, delivery, management, long-term stewardship, and preservation. Ann has an extensive background in digital archiving and user-driven support services, and has participated in the development and promotion of standards for statistical metadata and digital preservation. Current projects include policy development for the preservation of research data and cultural heritage collections, program evaluation of digital archives and related services, and articulating the requirements for ‘cradle to archive’ management of digital collections.