The Cambridge Replication Workshop 2013/14 just finished. In eight sessions, graduate students replicated a published paper and learned about reproducibility standards. This is a summary of student feedback on data transparency and the course itself. Some were extremely frustrated, a few dropped out, and those who stayed found the course “fantastic” and “incredible”.
Picking a paper
In the second run of the replication workshop, the main problem was for students to find a suitable paper to replicate. While last year the search for a paper (and data) took place during the first two weeks of the class, this year students were asked to bring papers to the first session to discuss suitability. I had provided guidelines and advice by email beforehand. The idea behind that was to speed things up and leave more time to replicate the papers. However, some were overwhelmed by the lack of data access, not knowing where to find data for their papers in the first place, or how to decide if methods in ‘their’ papers were too advanced. I should mention that this is not a methods course teaching e.g. regression that assigns replication as a homework, but a stand-alone workshop in which students need to be able to assess the methods in the papers and match it with their skills levels.
Software and statistics skills
Some of the students had come as far as bivariate association in their previous courses, others were more advanced. During the later stages of the course, TAs jumped in and provided tutorials e.g. for ordered logit models ‘on-the-go’. A second issue was that many students had only done an ‘Introduction to R’ class or even learned R themselves. R is an efficient software option to keep a log of used code; it is great for reproducibility. One student tried to learn R during the course (knowing STATA beforehand) and dropped out when that did not work. Others could run simple models in R, but had difficulties to clean, merge, transform and convert data from the originally provided files.
Data were not fully provided
A core problem for most students was data availability, like in the last replication workshop. Students’ feedback:
- “Not all the data was provided. The descriptions of the final regressions were impossible to understand.”
- “The data set the authors provided was different from what the authors said they used. At first glance, I thought it was just a matter of the number of participants as the data set they provided had more participants than what they published. However, it was during this replication workshop that I realised the data set they provided online could be a completely different file.”
- “The authors had initially uploaded wrong data. I contacted one of them and he sent me the real data of the paper. If it had not been possible to contact one of the authors it would have been hard to replicate any of the results published in the paper.”
Responses from the authors
One student from economics dropped out of the course because after having repeatedly asked authors of five different papers for their data without success, there was nothing to replicate in class. That was a low point for me as well, and I could not persuade her to stay and work on a paper further outside of her topic interest. The student felt it was a time waste to keep going, and that economics might be the wrong field for replication studies.
- Another student stayed in class, but was still disappointed by getting no answer: “I contacted the main author twice regarding the data set he provided. I waited for at least a week but he never replied.”
- Another student was happy with the author’s responses: “Yes, it was very helpful.”
What was most challenging in the course
I asked students about what was most challenging in the course and what were the biggest problems with the replication or any other aspects of the course (e.g. software)? Most said that time issues (I had a few drop-out’s due to the intensity of the workshop) were a problem, and using the right functions in R.
- “It is very time consuming to replicate a paper.”
- One student tried to find Rcode that matched a specific STATA command through text books, google and the R helplist. “The biggest challenge was obtaining the R code to run the model I was attempting to use, which ended up not being possible.”
- Another student said that being a “novice in terms of using R”, the main challenge was to figure out “Do I have enough knowledge with R and the functions that the authors use?”
- Another student said that “getting started with unfamiliar models” from the original paper was difficult.
- A student wrote: “I did not have good knowledge of R codes when I first started this course. Even though I knew some R codes it was difficult to getting my knowledge worked out on some format of data.”
How useful was the course overall?
- “I think his hugely beneficial for anyone who has a VERY good understanding of quantitative methods and the time and motivation to do it. I am surprised this isn’t done more often.”
- “I think this course is absolutely incredible. It taught me so much about how to publish legitimate and correct research. I have learned so much more about modelling and R than I ever could have imagined. As cheesy as it sounds, working through the process with the support of Nicole and the TAs gave me the confidence to try to create statistical models for projects that I had only dreamt of doing before. I cannot wait to apply my knowledge from this course to other projects.”
- “It was one of the best methodology courses that I took here in Cambridge. It was good that Nicole and TFs were always there to support us and explain why some of my R codes didn’t work or what kind of knowledge I need more to understand what function to use.”
- “It’s a great way to get some hands-on experience for sure.”
- “I think it was excellent! The workshop has a steep learning curve. But the approach was very hands on.”
- “I recommend this course as one of the good approaches to learning research methods. This course helped me understand not only specific stats concepts and analysis methods but also the whole procedure of interpreting research results (e.g., sample sizes, conceptual definitions, analyses methods, and limitations of using a specific method).”
- “It is also good to learning statistcal techniques because it is necessary to deal with not only core stats techniques but also related data formats and the ways of interpretations, which are all included in the process of replication.”
- “Like most quant courses for (non-economist) European soc scientists (and especially at Cambridge) there is a huge difficulty in that the students come from quite varying backgrounds. Replicating a paper and then publishing (well publishing anything) is a a huge time commitment. For the right student, the course as is is very beneficial.”
- “This course was fantastic!”
Reflections for next year
- Statistical knowledge: Ideally, students would have completed statistical courses in R up to ordered logit regression, and even better, panel data analysis. This does not always work out time-wise with the schedule of stats modules at Cambridge, but teaching methods within the replication course (through tutorials) takes up too much time. This problem would be less an issue for a methods course like Gary King’s that teaches statistics to get all students on a similar level, and assigns replications as homework.
- Software skills: At the Methods Centre at Cambridge, we have just started last year to introduce R. Not all students came with the right amount of software skills (although that was ‘officially’ in the prerequisites for the course. This led to time loss due to searching for R functions and helping with data cleaning, subsetting etc.
- Time: The course had eight sessions and time to work on the project over several weeks around the Christmas break. Some students would have preferred a longer course to get familiar with the methods and to have time to learn now techniques to add new value to the replication study. The Christmas break was intended to get the bulk of work done – but it ‘happened’ just after two weeks of the course, which was too early for some students to ‘really get started’. Last year, we had a similar break towards the end of the course, when students were more advanced with their replications and felt more confident to work on their own.
- Interdisciplinarity & Voluntariness: In student feedback, most preferred for the course to remain open to all fields, and to remain a voluntary, non-graded course. I understand that students would think so, but I also see advantages when such a course concentrates only on political science (or sociology, psychology), and when the course is compulsory. That might make it more efficient and could prevent drop-outs. On the other hand, students enjoyed weekly updates and discussions from other fields than their own, and those who stayed in the class achieved a lot even without the grading incentive.
Syllabus, Assignments, Handouts and Data
… are in our dataverse: http://thedata.harvard.edu/dvn/dv/CambridgeReplication.