RMarkdown for writing reproducible scientific papers

Mike Frank, adapted from work by Mike and Chris Hartgerink

2018-06-29

Introduction

This document is a short tutorial on using RMarkdown to mix prose and code together for creating reproducible scientific documents. If you find any errors and have a Github account, please suggest changes here. This is adapted from a slightly longer tutorial that Mike and Chris Hartgerink taught together at SIPS 2017.

In short: RMarkdown allows you to create documents that are compiled with code, producing your next scientific paper.

Now we’re together trying to help spread the word, because it can make writing manuscripts so much easier! We wrote this handout in RMarkdown as well. Take a look at the source.

Who is this aimed at?

We aim this document at anyone writing manuscripts and using R, including those who…

  1. …collaborate with people who use Word
  2. …want to write complex equations
  3. …want to be able to change bibliography styles with less hassle
  4. …want to spend more time actually doing research!

Why write reproducible papers?

Cool, thanks for sticking with us and reading up through here!

There are three reasons to write reproducible papers. To be right, to be reproducible, and to be efficient. There are more, but these are convincing to us. In more depth:

  1. To avoid errors. Using an automated method for scraping APA-formatted stats out of PDFs, @nuijten2016 found that over 10% of p-values in published papers were inconsistent with the reported details of the statistical test, and 1.6% were what they called “grossly” inconsistent, e.g. difference between the p-value and the test statistic meant that one implied statistical significance and the other did not. Nearly half of all papers had errors in them.

  2. To promote computational reproducibility. Computational reproducibility means that other people can take your data and get the same numbers that are in your paper. Even if you don’t have errors, it can still be very hard to recover the numbers from published papers because of ambiguities in analysis. Creating a document that literally specifies where all the numbers come from in terms of code that operates over the data removes all this ambiguity.

  3. To create spiffy documents that can be revised easily. This is actually a really big neglected one for us. At least one of us used to tweak tables and figures by hand constantly, leading to a major incentive never to rerun analyses because it would mean re-pasting and re-illustratoring all the numbers and figures in a paper. That’s a bad thing! It means you have an incentive to be lazy and to avoid redoing your stuff. And you waste tons of time when you do. In contrast, with a reproducible document, you can just rerun with a tweak to the code. You can even specify what you want the figures and tables to look like before you’re done with all the data collection (e.g., for purposes of preregistraion or a registered report).

Learning goals

By the end of this class you should:

Getting Started

Fire up Rstudio and create a new RMarkdown file. Don’t worry about the settings, we’ll get to that later.

If you click on “Knit” (or hit CTRL+SHIFT+K) the RMarkdown file will run and generate all results and present you with a PDF file, HTML file, or a Word file. If RStudio requests you to install packages, click yes and see whether everything works to begin with.

We need that before we teach you more about RMarkdown. But you should feel good if you get here already, because honestly, you’re about 80% of the way to being able to write basic RMarkdown files. It’s that easy.

Structure of an RMarkdown file

An RMarkdown file contains several parts. Most essential are the header, the body text, and code chunks.

Body text

The body of the document is where you actually write your reports. This is primarily written in the Markdown format, which is explained in the Markdown syntax section.

The beauty of RMarkdown is, however, that you can evaluate R code right in the text. To do this, you start inline code with `r, type the code you want to run, and close it again with a `. Usually, this key is below the escape (ESC) key or next to the left SHIFT button.

For example, if you want to have the result of 48 times 35 in your text, you type ` r 48-35`, which returns 13. Please note that if you return a value with many decimals, it will also print these depending on your settings (for example, 3.1415927).

Code chunks

In the section above we introduced you to running code inside text, but often you need to take several steps in order to get to the result you need. And you don’t want to do data cleaning in the text! This is why there are code chunks. A simple example is a code chunk loading packages.

First, insert a code chunk by going to Code->Insert code chunk or by pressing CTRL+ALT+I. Inside this code chunk you can then type for example, library(ggplot2) and create an object x.

library(ggplot2)

x <- 1 + 1

If you do not want to have the contents of the code chunk to be put into your document, you include echo=FALSE at the start of the code chunk. We can now use the contents from the above code chunk to print results (e.g., \(x=2\)).

These code chunks can contain whatever you need, including tables, and figures (which we will go into more later). Note that all code chunks regard the location of the RMarkdown as the working directory, so when you try to read in data use the relative path in.

Markdown syntax

Markdown is one of the simplest document languages around, that is an open standard and can be converted into .tex, .docx, .html, .pdf, etc. This is the main workhorse of RMarkdown and is very powerful. You can learn Markdown in five (!) minutes Other resources include http://rmarkdown.rstudio.com/authoring_basics.html, and this cheat sheet.

You can do some pretty cool tricks with Markdown, but these are the basics:

If you want a more extensive description of all the potential of Markdown, this introduction to Markdown is highly detailed.

The great thing about Markdown is that it works almost everywhere! Github, OSF, slack, many wikis, and even in text documents it looks pretty good. I find myself writing emails in markdown just because it’s a clear and consistent way to format and outline.

Exercises

Swap over to your new sample markdown.

  1. Outlining using headings is a really great way to keep things organized! Try making a bunch of headings, and then recompiling your document.
  2. Add a table of contents. This will involve going to the header of the document (the YAML), and adding some options to the html document bit. You want it to look like this (indentation must to be correct):
output: 
  html_document:
    toc: true

Now recompile. Looks nice, right?2 Pro-tip: you can specify how deep the TOC should go by adding toc_depth: 2 to go two levels deep

  1. Try adding another option: toc_float: true. Recompile – super cool. There are plenty more great output options that you can modify. Here is a link to the documentation.

Headers, Tables, and Graphs

Headers

We’re going to want more libraries loaded (for now we’re loading them inline).

library(knitr)
library(ggplot2)
library(broom)
library(devtools)

We often also add chunk options to each code chunk so that, for example:

There are many others available as well. Caching can be very helpful for large files, but can also cause problems when there are external dependencies that change. An example that is useful for manuscripts is:

opts_chunk$set(fig.width=8, fig.height=5, 
               echo=TRUE, 
               warning=FALSE, message=FALSE, 
               cache=TRUE)

Graphs

It’s really easy to include graphs, like this one. (Using the mtcars dataset that comes with ggplot2).

qplot(hp, mpg, col = factor(cyl), data = mtcars)

All you have to do is make the plot and it will render straight into the text.

External graphics can also be included, as follows:

knitr::include_graphics("path/to/file")

Tables

There are many ways to make good-looking tables using RMarkdown, depending on your display purpose.

We recommend starting with kable:

kable(head(mtcars), digits = 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.9 2.6 16.5 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.9 2.9 17.0 0 1 4 4
Datsun 710 22.8 4 108 93 3.8 2.3 18.6 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.1 3.2 19.4 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.1 3.4 17.0 0 0 3 2
Valiant 18.1 6 225 105 2.8 3.5 20.2 1 0 3 1

Statistics

It’s also really easy to include statistical tests of various types.

For this, an option is the broom package, which formats the outputs of various tests really nicely. Paired with knitr’s kable you can make very simple tables in just a few lines of code.

mod <- lm(mpg ~ hp + cyl, data = mtcars)
kable(tidy(mod), digits = 3)
term estimate std.error statistic p.value
(Intercept) 36.908 2.191 16.847 0.000
hp -0.019 0.015 -1.275 0.213
cyl -2.265 0.576 -3.933 0.000

Of course, cleaning these up can take some work. For example, we’d need to rename a bunch of fields to make this table have the labels we wanted (e.g., to turn hp into Horsepower).

We often need APA-formatted statistics. We can compute them first, and then print them inline.

ts <- with(mtcars,t.test(hp[cyl==4], hp[cyl==6]))

There’s a statistically-significant difference in horsepower for 4- and 6-cylinder cars (\(t(11.49) = -3.56\), \(p = 0.004\)).

To insert these stats inline I wrote e.g. round(ts$parameter, 2) inside an inline code block.3 APA would require omission of the leading zero. papaja::printp() will let you do that, see below.

Note that rounding can occasionally get you in trouble here, because it’s very easy to have an output of \(p = 0\) when in fact \(p\) can never be exactly equal to 0. Nonetheless, this can help you prevent rounding errors and the wrath of statcheck.

Exercises

  1. Using the mtcars dataset, insert a table and a graph of your choice into the document.4 If you’re feeling uninspired, try hist(mtcars$mpg).

Collaboration

How do we collaborate using RMarkdown? There are lots of different workflows that people use. The way it works in my lab is that the first author typically makes a github repository with the markdown-formatted document in it. Sometimes we just collaborate through github or through writing comments on the rendered PDF and sending them back to the first author. (I like the dropbox PDF comment interface for this).

But, sometimes you want to do lots of line-editing or write collaboratively, especially with someone who doesn’t like github and markdown and all that. For these cases, we often paste the intro into google docs or Word and edit until we converge, then the first author puts that back into the markdown. This is a little clunky, but not too bad. And critically, all the figures and numbers get rendered fresh when you re-knit, so nothing can get accidentally altered during the editing process.

Writing APA-format papers

(Thanks to Frederick Aust for contributing this section!)

The end-game of reproducible research is to knit your entire paper. We’ll focus on APA-style writeups. Managing APA format is a pain in the best of times. Isn’t it nice to get it done for you?

We’re going to use the papaja package. papaja is a R-package including a R Markdown template that can be used to produce documents that adhere to the American Psychological Association (APA) manuscript guidelines (6th Edition).

Software requirements

To use papaja, make sure you are using the latest versions of R and RStudio. If you want to create PDF- in addition to DOCX-files you need TeX 2013 or later. Try MikTeX for Windows, MacTeX for Mac, or TeX Live for Linux. Some Linux users may need a few additional TeX packages for the LaTeX document class apa6 to work.5 For Ubuntu, we suggest running: sudo apt-get install texlive texlive-publishers texlive-fonts-extra texlive-latex-extra texlive-humanities lmodern.

Installing papaja

papaja has not yet been released on CRAN but you can install it from GitHub.

# Install devtools package if necessary
if(!"devtools" %in% rownames(installed.packages())) install.packages("devtools")

# Install papaja
devtools::install_github("crsh/papaja")

Creating a document

The APA manuscript template should now be available through the RStudio menus when creating a new R Markdown file.

When you click RStudio’s Knit button papaja, rmarkdown, and knitr work together to create an APA conform manuscript that includes both your manuscript text and the results of any embedded R code.

Note, if you don’t have TeX installed on your computer, or if you would like to create a Word document replace output: papaja::apa6_pdf with output: papaja::apa6_word in the document YAML header.

papaja provides some rendering options that only work if you use output: papaja::apa6_pdf. figsintext indicates whether figures and tables should be included at the end of the document—as required by APA guidelines—or rendered in the body of the document. If figurelist, tablelist, or footnotelist are set to yes a list of figure captions, table captions, or footnotes is given following the reference section. lineno indicates whether lines should be continuously numbered through out the manuscript.

Bibiographic management

It’s also possible to include references using bibtex, by using @ref syntax. An option for managing references is bibdesk, which integrates with google scholar.6 But many other options are possible.

With a bibtex file included, you can refer to papers. As an example, @nuijten2016 results in the in text citation “@nuijten2016”, or cite them parenthetically with [@nuijten2016] [@nuijten2016]. Take a look at the papaja APA example to see how this works.

citr is an R package that provides an easy-to-use RStudio addin that facilitates inserting citations. The addin will automatically look up the Bib(La)TeX-file(s) specified in the YAML front matter. The references for the inserted citations are automatically added to the documents reference section.

Once citr is installed (install.packages("citr")) and you have restarted your R session, the addin appears in the menus and you can define a keyboard shortcut to call the addin.

Exercise

Make sure you’ve got papaja, then open a new template file. Compile this document, and look at how awesome it is. (To compile you need texlive, a library for compiling markdown to PDF, so you may need to wait and install this later if it’s not working).

Try pasting in your figure and table from your other RMarkdown (don’t forget any libraries you need to make it compile). Presto, ready to submit!

For a bit more on papaja, check out this guide.