2018-06-13

Objectives

You will learn to:

  • install and run R and Rstudio on your machine
  • use R on the clusters
  • explore ggplot2 and dplyr on the dataSaurus
  • summarise a dataset using different packages and benchmark them
  • demonstrate why packages are better than R base
  • perform single node parallelisation on iris

What is R?

is shorthand for “GNU R”:

  • An interactive programming language derived from S (J. Chambers, Bell Lab, 1976)
  • Appeared in 1993, created by R. Ihaka and R. Gentleman, University of Auckland, NZ
  • Focus on data analysis and plotting
  • is also shorthand for the ecosystem around this language
    • Book authors
    • Package developers
    • Ordinary useRs

Learning to use R will make you more efficient and facilitate the use of advanced data analysis tools

Why use R?

  • It’s free! and open-source
  • easy to install / maintain
  • multi-platform (Windows, macOS, GNU/Linux)
  • can process big files and analyse huge amounts of data (db tools)
  • integrated data visualization tools, even dynamic
  • fast, and even faster with C++ integration via Rcpp.
  • easy to get help

Twitter R community

Constant trend

Packages

+12,000 in Feb 2018

CRAN

reliable: package is checked during submission process

MRAN for Windows users

bioconductor

dedicated to biology. status

typical install:

source("https://bioconductor.org/biocLite.R")
biocLite("limma")

GitHub

easy install thanks to devtools. status

# install.packages("devtools")
devtools::install_github("tidyverse/readr")

could be a security issue

CRAN install from Rstudio

github install from Rstudio’ console

more in the article from David Smith

R is hard to learn

R base is complex, has a long history and many contributors

Why R is hard to learn

  • Unhelpful help ?print
  • generic methods print.data.frame
  • too many commands colnames, names
  • inconsistent names read.csv, load, readRDS
  • unstrict syntax, was designed for interactive usage
  • too many ways to select variables df$x, df$"x", df[,"x"], df[[1]]
  • […] see r4stats’ post for the full list
  • the tidyverse curse

Navigating the balance between base R and the tidyverse is a challenge to learn Robert A. Muenchen

Tidyverse

creator

We think the tidyverse is better, especially for beginners. It is

  • recent (both an issue and an advantage)
  • allows doing powerful things quickly
  • unified
  • consistent, one way to do things
  • give strength to learn base R
  • criticisms will come later (yes, many)

Hadley Wickham

Hadley, Chief Scientist at Rstudio

  • coined the tidyverse at userR meeting in 2016
  • developed and maintains most of the core tidyverse packages

RStudio

Rstudio

makes working with R easier

RStudio is an Integrated Development Environment .

Features

  • Console to run R, with syntax highlighter
  • Editor to work with scripts
  • Viewer for data / plots
  • Package management (including building)
  • Autocompletion using TAB
  • Cheatsheets
  • Git integration for versioning
  • Build for website / packages
  • Inline outputs (>= v1.03)
  • Keyboard shortcuts
  • Notebooks

Warning

Don’t mix up R and RStudio.
R needs to be installed first.

Rstudio

The 4 panels layout

Four panels

scripting

  • could be your main window
  • should be a Rmarkdown doc
  • tabs are great

Environment

  • Environment, display loaded objects and their str()
  • History is useless IMO
  • nice git integration
  • database connections interface

Console

  • could be hidden with inline outputs
  • embed a nice terminal tab
  • Rmarkdown logs

Files / Plots / Help

  • necessary package management tab
  • unnecessary plots tabs with inline outputs
  • help tab

For reproducibility

options to activate / deactivate

Code diagnostics

highly recommended

Data types and structures

R base

Necessary R base

We could let base down, but the tidyverse is wrapping around it

Some functions need to be known. And in R, everything is a function.

Advices from David Robinson

I teach them X just to show them how much easier Y is

teaching programming is hard, don’t make it harder

Getting started

Let’s get ready to use R and RStudio

Do the following

  • Open up RStudio
  • Maximize the RStudio window
  • Click the Console pane, at the prompt (>) type in 3 + 2 and hit enter
> 3 + 2

4 main types

mode()

Type Example
numeric integer (2), double (2.34)
character (strings) “tidyverse!”
boolean TRUE / FALSE
complex 2+0i

in the console

2L
[1] 2
typeof(2L)
[1] "integer"
mode(2L)
[1] "numeric"
2.34
[1] 2.34
typeof(2.34)
[1] "double"
"tidyverse!"
[1] "tidyverse!"
TRUE
[1] TRUE
2+0i
[1] 2+0i

Special case

NA   # not available, missing data
NA_real_
NA_integer_
NA_character_
NA_complex_
NULL # empty
-Inf/Inf # infinite values
NaN # Not a Number

Structures

Vectors

c() is the function for concatenate

4
c(43, 5.6, 2.90)
[1] 4
[1] 43.0  5.6  2.9

Factors

convert strings to factors, levels is the dictionary

factor(c("AA", "BB", "AA", "CC"))
[1] AA BB AA CC
Levels: AA BB CC

Lists

very important as it can contain anything

list(f = factor(c("AA", "AA")),
     v = c(43, 5.6, 2.90),
     s = 4L)
$f
[1] AA AA
Levels: AA

$v
[1] 43.0  5.6  2.9

$s
[1] 4

Data frames are special lists

data.frame

same as list but where all objects must have the same length

Example, 3 elements of same size

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6, 2.90),
  s = rep(4, 3))
   f    v s
1 AA 43.0 4
2 AA  5.6 4
3 BB  2.9 4

Example, missing one element in v

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6),
  s = rep(4, 3))
Error in data.frame(f = factor(c("AA", "AA", "BB")), v = c(43, 5.6), s = rep(4, : arguments imply differing number of rows: 3, 2

Concatenate atomic elements

i.e build a vector

collection of simple things

  • things are the smallest elements: atomic
  • must be of same mode: automatic coercion
  • indexed, from 1 to length(vector)
  • created with the c() function
c(2, TRUE, "a string")
[1] "2"        "TRUE"     "a string"

assignment operator, create object

operator is <-, associate a name to an object

my_vec <- c(3, 4, 1:3)
my_vec
[1] 3 4 1 2 3

Tip

Rstudio has the built-in shortcut Alt+- for <-

hierarchy

source: H. Wickham - R for data science, licence CC

in console

is.vector(c("a", "c"))
[1] TRUE
mode(c("a", "c"))
[1] "character"
is.vector(list(a = 1))
[1] TRUE
is.atomic(list(a = 1))
[1] FALSE
is.data.frame(list(a = 1))
[1] FALSE

Vectors

subsetting

important

Unlike python or Perl, vectors use 1-based index!!

: operator

generate integer sequence

3:10
[1]  3  4  5  6  7  8  9 10

How to extract > 1 element

select elements from position 3 to 10:

LETTERS[3:10]
[1] "C" "D" "E" "F" "G" "H" "I" "J"

break in sequence

LETTERS[c(2:5, 7)]
[1] "B" "C" "D" "E" "G"

negative selection

LETTERS[-c(2:21)]
[1] "A" "V" "W" "X" "Y" "Z"

Vectorized operation

one of the best R feature

my_vec <- 10:18
my_vec
[1] 10 11 12 13 14 15 16 17 18
my_vec + 2
[1] 12 13 14 15 16 17 18 19 20

warning

  • R recycles vectors that are too short
  • without any warnings:
1:10 + c(1, 2)
 [1]  2  4  4  6  6  8  8 10 10 12

avoid writing loops

(still remember not to grow a vector)

res <- vector(mode = "numeric", length = length(my_vec))
for (i in seq_along(my_vec)) {
  res[i] <- my_vec[i] + 2
}
res
[1] 12 13 14 15 16 17 18 19 20

Tidyverse

packages in processes

Tidyverse criticism

jobs

Personal complains

  • still young so change quickly and drastically
  • Backward compatibility is not always maintained.
  • tibbles are nice but a lot of non-tidyverse packages require matrices. rownames still an issue.

No need for opposition base / tidyverse

Learning the tidyverse does not prevent to learn R base, it helps to get things done early in the process

Community complains

Practical Session

dataSaurus & furrr

Wrap up

You learned to:

  • Introduction
    • R
    • Rstudio
    • tidyverse rationale
  • data types
    • main categories
    • coerce
  • data structures
    • main categories
    • sub-setting
    • vectorization

Acknowledgements