2019-06-21

Objectives

You will learn to:

  • quick intro to R
  • focus on the tidyverse dialect
  • explore ggplot2 and dplyr on dataSaurus (beginner only)
  • summarise a dataset using different packages and benchmark them
  • use R on the clusters
  • perform single node parallelisation on iris

What is R?

is shorthand for “GNU R”:

  • An interactive programming language derived from S (J. Chambers, Bell Lab, 1976)
  • Appeared in 1993, created by R. Ihaka and R. Gentleman, University of Auckland, NZ
  • Focus on data analysis and plotting
  • R is also shorthand for the ecosystem around this language
    • Book authors
    • Package developers
    • Ordinary useRs

Learning to use R will make you more efficient and facilitate the use of advanced data analysis tools

Why use R?

  • It’s free! and open-source
  • easy to install / maintain
  • multi-platform (Windows, macOS, GNU/Linux)
  • can process big files and analyse huge amounts of data (db tools)
  • integrated data visualization tools, even dynamic shiny
  • fast, and even faster with C++ integration via Rcpp.
  • easy to get help

Constant trend

Packages

+16,000 in Feb 2019

CRAN

reliable: package is checked during submission process

MRAN for Windows users

bioconductor

dedicated to biology. status

typical install:

# install.packages("BiocManager")
BiocManager::install("limma")

GitHub

easy install thanks to remotes.

# install.packages("remotes")
remotes::install_github("tidyverse/readr")

could be a security issue

CRAN install from Rstudio

github install from Rstudio’ console

more in the article from David Smith

R ambiguity

Roger D. Peng

R is hard to learn

R base is complex, has a long history and many contributors

Why R is hard to learn

  • Unhelpful help ?print
  • generic methods print.data.frame
  • too many commands colnames, names
  • inconsistent names read.csv, load, readRDS
  • unstrict syntax, was designed for interactive usage
  • too many ways to select variables df$x, df$"x", df[,"x"], df[[1]]
  • […] see r4stats’ post for the full list
  • the tidyverse curse

Navigating the balance between base R and the tidyverse is a challenge to learn Robert A. Muenchen

Tidyverse

creator

We think the tidyverse is better, especially for beginners. It is

  • recent (both an issue and an advantage)
  • allows doing powerful things quickly
  • unified
  • consistent, one way to do things
  • give strength to learn base R
  • criticisms exist

Hadley Wickham

Hadley, Chief Scientist at Rstudio

  • coined the tidyverse at userR meeting in 2016
  • developed and maintains most of the core tidyverse packages

Tidy data

Definition

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Tidyverse

trends

Tidyverse

trends

Tidyverse

packages in processes

Tidyverse components

core / extended

Core

  • ggplot2, for data visualization
  • dplyr, for data manipulation
  • tidyr, for data tidying
  • readr, for data import
  • purrr, for functional programming
  • tibble, for tibbles, a modern re-imagining of data frames
  • stringr, for strings
  • forcats, for factors

source: http://tidyverse.tidyverse.org/. H.Wickham

Extended

  • Modelling
    • modelr, for modelling within a pipeline
    • broom, for models -> tidy data
  • Programming
    • rlang, low-level API
    • glue, alternative to paste

Reproducibility with RMarkdown reports

Why using rmarkdown?

  • write detailed reports
  • ensure reproducibility
  • keep track of your analyses
  • comment/describe each step of your analysis
  • export a single (Rmd) document to various formats (Pdf, Html…)
  • text file that can be managed by a version control system (like git)

Including R code

Rmarkdown document

Rmarkdown

  • extends markdown
  • place R code in chunks
  • chunks will be evaluated
  • can also handle bash; python; css; …

Knitr

  • extracts R chunks
  • interprets them
  • formats results as markdown
  • reintegrates them into the main document (md)

Pandoc

  • pandoc converts markdown to the desired document (Pdf, Html, …)

Rstudio

makes working with R easier

RStudio is an Integrated Development Environment .

Features

  • Console to run R, with syntax highlighter
  • Editor to work with scripts
  • Viewer for data / plots
  • Package management (including building)
  • Autocompletion using TAB
  • Cheatsheets
  • Git integration for versioning
  • Build for website / packages
  • Inline outputs (>= v1.03)
  • Keyboard shortcuts
  • Notebooks

Warning

Don’t mix up R and RStudio.
R needs to be installed first.

Rstudio

The 4 panels layout

For reproducibility

options to activate / deactivate

Organising files

Use projects

Data types and structures

R base

4 main types

mode()

Type Example
numeric integer (2), double (2.34)
character (strings) “tidyverse!”
boolean TRUE / FALSE
complex 2+0i

in the console

2L
[1] 2
typeof(2L)
[1] "integer"
mode(2L)
[1] "numeric"
2.34
[1] 2.34
typeof(2.34)
[1] "double"
"tidyverse!"
[1] "tidyverse!"
TRUE
[1] TRUE
2+0i
[1] 2+0i

Special case

NA   # not available, missing data
NA_real_
NA_integer_
NA_character_
NA_complex_
NULL # empty
-Inf/Inf # infinite values
NaN # Not a Number

Structures

Vectors

c() is the function for concatenate

4
c(43, 5.6, 2.90)
[1] 4
[1] 43.0  5.6  2.9

Factors

convert strings to factors, levels is the dictionary

factor(c("AA", "BB", "AA", "CC"))
[1] AA BB AA CC
Levels: AA BB CC

Lists

very important as it can contain anything

list(f = factor(c("AA", "AA")),
     v = c(43, 5.6, 2.90),
     s = 4L)
$f
[1] AA AA
Levels: AA

$v
[1] 43.0  5.6  2.9

$s
[1] 4

Data frames are special lists

data.frame

same as list but where all objects must have the same length

Example, 3 elements of same size

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6, 2.90),
  s = rep(4, 3))
   f    v s
1 AA 43.0 4
2 AA  5.6 4
3 BB  2.9 4

assignment operator, create object

operator is <-, associate a name to an object

my_vec <- c(3, 4, 1:3)
my_vec
[1] 3 4 1 2 3

Tip

Rstudio has the built-in shortcut Alt+- for <-

hierarchy

source: H. Wickham - R for data science, licence CC

in console

is.vector(c("a", "c"))
[1] TRUE
mode(c("a", "c"))
[1] "character"
is.vector(list(a = 1))
[1] TRUE
is.atomic(list(a = 1))
[1] FALSE
is.data.frame(list(a = 1))
[1] FALSE

Vectorized operation

one of the best R feature

my_vec <- 10:18
my_vec
[1] 10 11 12 13 14 15 16 17 18
my_vec + 2
[1] 12 13 14 15 16 17 18 19 20

warning

  • R recycles vectors that are too short
  • without any warnings:
1:10 + c(1, 2)
 [1]  2  4  4  6  6  8  8 10 10 12

avoid writing loops, someone else already did (purrr, lapply)

(still remember not to grow a vector)

res <- vector(mode = "numeric", length = length(my_vec))
for (i in seq_along(my_vec)) {
  res[i] <- my_vec[i] + 2
}
res
[1] 12 13 14 15 16 17 18 19 20

For loops are fine

growing

for_loop <- function(x) {
  res <- c()
  for (i in seq_len(x)) {
    res[i] <- i
  }
}

alloc

for_loop <- function(x) {
  res <- vector(mode = "integer", 
                length = x)
  for (i in seq_len(x)) {
    res[i] <- i
  }
}

Rcpp

library(Rcpp)
cppFunction("NumericVector rcpp(int x) {
  NumericVector res(x);
  for (int i=0; i < x; i++) {
    res[i] = i;
  }
}")

purrr::map() example

type stable

For 3 cyl groups on mtcars

  • fit a linear model (miles per gallon explained by the weight)
    • the equation is then: \(mpg = \beta_0 + \beta_1 \times wt\),
    • formula in R: mpg ~ wt

map the linear model

  • map(YOUR_LIST, YOUR_FUNCTION)
  • YOUR_LIST = spl_mtcars
  • YOUR_FUNCTION can be an anonymous function (declared on the fly)

spl_mtcars <- group_split(mtcars, cyl)
spl_mtcars %>% 
  map(~lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map_dbl("r.squared") %>% 
  str()
 num [1:3] 0.509 0.465 0.423

one step per line

  • generate 3 tibbles (list)
  • run the linear model on each (list)
    • ~ shortcut to foronymous function
  • summarised 3 linear models (list)
    • even better with broom::glance()
  • extract \(R^2\) (doubles)
    • _dbl() force a num vector. Error is coercion fails

Acknowledgements

Practical Session

dataSaurus & furrr