Research starts with data collection. Before you can do your analysis, you spend hours, weeks, months merging tables and transforming variables. This time is wasted if you don’t keep detailed logs about this process. Here’s a good practice guide.
Basic data management e.g. using syntax to create an audit trail, not creating multiple versions of the same file, etc. is hugely important, but rarely taught, and hence often very badly performed, introducing errors and data structure problems that haunt later analyses… (Chris Stride, statistician, Sheffield University)
- When downloading secondary data, keep a note of the date, version, source incl. URL, and their suggested citation.
- Download all codebooks available for that particular version of the data set.
Tip: I take screenshots when I download data, which contains much of that info.
Create a consistent folder structure for
- data downloads for raw data
- data merging, and
- data analysis
- Put the raw data into a separate folder.
- Never touch the original, raw data files. Ever. Do not change rows, columns, do not delete anything, do not add anything. Original data are never to be touched again after downloading them.
Tip: I have a folder called “Data_Downloads”. Each sub-folder e.g. UNCTAD, WORLD BANK etc. contains the original excel or .csv files that I downloaded under “original data”. Other people have different folder structures but they follow the same principle. See here, or in Gandrud’s chapter on file management and data storing [pdf], or by statistician Jeff Leeks.
If research data are well organised, documented, preserved and accessible, and their accuracy and validity is controlled at all times, the results is highly quality data… (UK Data Archive, 2011).
Merging and Cleaning
- Merging and cleaning of data happens in a separate step from the data download. Therefore, create a folder called “merging data”.
- Within that, I merge my data in an Rscript called “creating_mastertable.R” (you could also call this e.g. “merging.R”). This Rscript loads all original data, cleans them, and compiles them into one large table per country and year (in my case). This table will be saved as my new, tidy data set for the analysis (I call that “mastertable”). Never copy paste numbers from one excel file to another to create such a table.
- Whenenver I add a new variable to the data set, I switch from creating_mastertablev01.R to creating_mastertablev02.R because sometimes adding a new variable can mess things up.
- When you have added all variables together into your master table, save it as a new file e.g. mastertablev01.R or mastertablev01.csv. This can later be loaded into your analysis Rscript.
Tip: When I transform or recode variables (e.g. to take the log or to re-code conflict data), I add the newly created variable as an additional column to my master table. I can then check more easily if the transformation worked and I can also use both versions in my analysis to check for robustness.
Researchers need to improve, enhance and professionalize their research data management skills (Corti et al, 2014)
Everything that you do with your variables should be documented from day 1.
- Make notes in a simple text file and keep that in the same folder as your data merging Rcode. I sometimes write down my rationale for kicking out a specific year, or changing a number to “NA” because I happen to know that that particular year we cannot trust that data. I keep going back to these text files whenever I add new data to the mastertable, or when I update the data with more recent years.
- You can even call this file your ‘codebook’ and do that in pretty table in excel, word or latex. Keeping a codebook from day one will save you a lot of time later.
Sharing Data with Collaborators
All the above will help you to share your data with collaborators. You will make it easy for them to trust your data. Jeff Leek has excellent tips on each of the files below that you should make available to your co-authors:
- The raw data (original data, unchanged!)
- A tidy data set (your master table)
- A code book describing each variable and its values in the tidy data set.
- An explicit and exact recipe you used to go from 1 -> 2,3 (Rscript, do-file etc with comments, or a separate document)
Data management is a big (and sometimes daunting) task (Michael N. Mitchell, 2010)
Tips for R, STATA and SPSS
- Use syntax in SPSS.
- Create do-files in STATA for your file management.
- Follow conventions on how to comment your code in R.
- Consider using Rmarkdown, knitr, GitHub, project template, or an R package to present your data collection, analysis and outputs all-in-one.
- Hadley Wickham, Tidy Data [pdf]
- Svend Juul, Take good care of your data [pdf]
- Jeff Leek, The Elements of Data Analytic Style [I paid 12£ but you can download for free on Leanpub]
- UK Data Archive, Managing and Sharing Data [pdf]
- Chris Stride, Data management using SPSS syntax
- Michael N. Mitchell (2010). Data Management Using Stata: A Practical Handbook
- Corti et al. (2014), Managing and Sharing Research Data. A Guide to Good Practice. Sage.
- Ball, Richard J. and Medeiros, Norm, Teaching Students to Document Their Empirical Research (July 12, 2011). Available at SSRN.
- BITSS Summer Institute Network, A collection of teaching materials to promote transparency in research
This collections of best practice examples was inspired by a discussion on good practice in data management on the Quantiative Methods Teaching list.