Helpful data wranging functions with tidyR & dplyR

May 11th, 2017

Who am I?

5th year PhD student in Ecology & Evolutionary Biology
- Dissertation work: patterns of microbial diversity in freshwater lakes.
R enthusiast/user since 2014.
- Exactly one year ago today, I taught an RMarkdown workshop.
Graduate Teaching Consultant (with specialty in instructional technology) at the Center for Research on Learning and Teaching (CRLT)
Software Carpentry Instructor
FEMMES Executive board member (Females Excelling More in Math Engineering in Science)
Rock climber
Once I swam from Asia to Europe

Who am I not?

An expert on tidy data
An expert R programmer

What to Expect

Using datasets from the EDAWR package¹…

What's tidy data?²
Helpful tidyR verbs³
Pipes from the magrittr package⁴
- Send piped output to ggplot2⁵
Useful dplyR⁶ verbs

Raise your hand if over the last year you have worked with data …

daily?
A few times a week?
A few times a month?

Raise your hand if you work in the …

humanities?
social sciences?
STEM (science, technology, engineering, math)?

Why use tidyR & dplyR?

50-80% of data analysis is spent on cleaning and preparing data²
- These packages were developed to make data cleaning/preparation/reshaping more efficient and easy
Avoid "metadata hell" and being a culprit on #otherpeoplesdata

Prepare your R environment

library(tidyr)      # Let's organize our data!
library(dplyr)      # Let's manipulate our data!
library(EDAWR)      # Some data examples
library(ggplot2)    # For plots

`tidyR` Efficient data reshaping

Tidy Data

Each variable is saved in its own column.
Each observation is saved in its own row.
Each "type" of observational unit is a single table.

Easy access to variables & Automattically preseve observations

Tidy Data

Easy access to variables & Automattically preseve observations

head(storms, n = 2)

## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2    Alex    45     1009 1998-07-27

Characteristics of messy data²

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

Part 1: Is this data tidy?

cases # Tuberculosis cases from 3 countries

##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

What variables are in this data?

Part 1: Is this data tidy?

cases # Tuberculosis cases from 3 countries

##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

No, column headers are values, not variable names.

3 variable names should be:
1. country
2. year
3. count or n

How to make this data tidy?

How to make the data tidy? With `gather()`

##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

gather(data, "year", "cases", 2:3)

How to make the data tidy? With `gather()`

gather() collapses multiple columns into 2 columns:
1. Key: Column with former column names. E.g. "year"
2. Value: Column with values in former column cells. E.g. "n"
3. Indices: Which columns to gather

gather(data, "key", "value", indices)

gather(cases, "year", "n", 2:4)

##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000
## 6      US 2012 14000
## 7      FR 2013  7000
## 8      DE 2013  6200
## 9      US 2013 13000

Part 2: Is this data tidy?

pollution # particulate load from 3 cities

##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

What variables are in this data?

Part 2: Is this data tidy?

pollution # particulate load from 3 cities

##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

No, multiple variables are stored in one column.

3 variable names should be:
1. city
2. Amount of large particles
3. Amount of small particles

How to make the data tidy? With `spread()`

##       city  size amount
## 1 New York large     23
## 2 New York small     14

spread(data, "key", "value")

How to make the data tidy? With `spread()`

spread() creates multiple columns from 2 columns:
1. Unique values in the key column become a column name. E.g. "size"
2. Each value in the value column becomes a cell in new columns. E.g. "amount"

spread(data, "key", "value")

spread(pollution, "size", "amount")

##       city large small
## 1  Beijing   121    56
## 2   London    22    16
## 3 New York    23    14

Rename columns with `rename()`

head(pollution, n = 3)

##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22

rename(pollution, concentration = amount)

##       city  size concentration
## 1 New York large            23
## 2 New York small            14
## 3   London large            22
## 4   London small            16
## 5  Beijing large           121
## 6  Beijing small            56

2 helpful `tidyR` verbs

gather() vs spread()

the pipe operator: %>%

Pipes: %>%

R's copy of the unix pipe: |
Will directly move the output of one function to another function
Reduces the number of variables needed
Increases code readability

Pipes: %>%

gather(cases, "year", "n", 2:4)

cases %>%                    # Dataframe in use
  gather("year", "n", 2:4)   # Spread the data into columns: size and amount

##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000
## 6      US 2012 14000
## 7      FR 2013  7000
## 8      DE 2013  6200
## 9      US 2013 13000

Also known as: cases %>% gather("year", "n", 2:4)

Pipe into a plot & build ggplot layers with +

cases %>%                       # Dataframe in use
  gather("year", "n", 2:4) %>%  # Spread the data into columns: size and amount
  ggplot(aes(x = country, y = n, color = country, fill = country)) +     # Plot the data 
    geom_boxplot(alpha = 0.5) + geom_jitter(size = 2.5) +theme_bw()

`dplyR` Efficient data transformation

Helpful `dplyR` functions

Accessing information

Extract existing variables: select()
Extract existing observations: filter()
Calculate/derive new variables: mutate()
summarise() multiple values into one value.
group_by() groups factors together

Joins

left_join(), right_join(), anti_join(), inner_join(), and more

3 ways to Pick columns with `select()`

List the column names you do want

storms %>%  select(storm, wind)   # Pick only storm and wind columns
storms %>% select(storm:wind)     # Pick storm to wind columns

List the column names you don't want

storms %>% select(-pressure)            # Remove one column, pressure 
storms %>% select(-c(pressure, date))   # Remove multiple columns, pressure and date

Select the column indices

storms %>% select(1:2)      # Pick only columns 1 & 2

Helper functions with select

Select columns that:

contains(): Are based on a character string
starts_with() or ends_with(): Contain part of a character string
one_of(): Names are in a group of names
matches(): Name matches a regular expression

storms %>% select(contains("s")) %>% head(n = 3)

## # A tibble: 3 × 2
##     storm pressure
##     <chr>    <int>
## 1 Alberto     1007
## 2    Alex     1009
## 3 Allison     1005

Pick rows with `filter()`

storms %>% filter(wind >= 50) # Wind speeds equal to or greater than 50 mph

## # A tibble: 3 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2 Allison    65     1005 1995-06-03
## 3  Arlene    50     1010 1999-06-11

storms %>% filter(storm %in% c("Ana", "Alberto")) # Pick Ana & Alberto

## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2     Ana    40     1013 1997-06-30

Create new variables with `mutate()`

head(storms, n = 2)

## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2    Alex    45     1009 1998-07-27

Calculate the ratio of pressure to wind

storms %>% mutate(ratio = pressure/wind) %>% head(n = 2)

## # A tibble: 2 × 5
##     storm  wind pressure       date     ratio
##     <chr> <int>    <int>     <date>     <dbl>
## 1 Alberto   110     1007 2000-08-03  9.154545
## 2    Alex    45     1009 1998-07-27 22.422222

Create new variables with `mutate()`

storms %>% 
  mutate(ratio = pressure/wind) %>%     # Calculate the ratio of pressure to wind
  ggplot(aes(x = date, y = ratio)) + geom_line(size = 1.5, color = "grey") + 
  theme_bw() + geom_point(size = 4, color = "cornflowerblue")

`summarise()` multiple values to one value

head(storms, n = 1)

## # A tibble: 1 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03

storms %>% 
  summarise(mean_wind = mean(wind),           # Mean wind of storms
            max_wind = max(wind),             # Max wind of storms
            median_pres = mean(pressure),     # Mean pressure of storms
            n = n())                          # Number of values measured

## # A tibble: 1 × 4
##   mean_wind max_wind median_pres     n
##       <dbl>    <int>       <dbl> <int>
## 1  59.16667      110        1009     6

Sort values with `arrange()`

Sort in ascending order

storms %>% arrange(wind)

## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1     Ana    40     1013 1997-06-30
## 2    Alex    45     1009 1998-07-27
## 3  Arthur    45     1010 1996-06-17
## 4  Arlene    50     1010 1999-06-11
## 5 Allison    65     1005 1995-06-03
## 6 Alberto   110     1007 2000-08-03

Sort values with `arrange()`

Sort in descending order

storms %>% arrange(desc(wind))

## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2 Allison    65     1005 1995-06-03
## 3  Arlene    50     1010 1999-06-11
## 4    Alex    45     1009 1998-07-27
## 5  Arthur    45     1010 1996-06-17
## 6     Ana    40     1013 1997-06-30

Sort values with `arrange()`

Sort first by wind, then date

storms %>% arrange(wind, date)

## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1     Ana    40     1013 1997-06-30
## 2  Arthur    45     1010 1996-06-17
## 3    Alex    45     1009 1998-07-27
## 4  Arlene    50     1010 1999-06-11
## 5 Allison    65     1005 1995-06-03
## 6 Alberto   110     1007 2000-08-03

`group_by()`

pollution %>% 
  group_by(city)    # Group by city

`group_by()`

Efficient group statistics with group

pollution %>% 
  group_by(city) %>%
  summarise(mean_particles = mean(amount))

## # A tibble: 3 × 2
##       city mean_particles
##      <chr>          <dbl>
## 1  Beijing           88.5
## 2   London           19.0
## 3 New York           18.5

Joins

Joining data sets together

songs

##                  song  name
## 1 Across the Universe  John
## 2       Come Together  John
## 3      Hello, Goodbye  Paul
## 4           Peggy Sue Buddy

artists

##     name  plays
## 1 George  sitar
## 2   John guitar
## 3   Paul   bass
## 4  Ringo  drums

Join 2 datasets together

right_join(a, b, by = "V1"): Join matching rows from a to b
left_join(a, b, by = "V1"): Join matching rows from b to a

left_join(songs, artists, by = "name")

Join 2 datasets together

full_join(a, b, by = "V1"): Join data and retain everything
inner_join(a, b, by = "V1"): Join data but only keep rows in both sets

inner_join(songs, artists, by = "name")

Filtering joins

semi_join(a, b, by = "V1"): Keep all rows that have a match in b
anti_join(a, b, by = "V1"): Keep all rows that do not have a match in b

anti_join(songs, artists, by = "name")

Resources

These slides
Data wrangingling cheatsheet with tidyr & dplyr
Garrett Grolemund Data Wrangling with R slides
- Note: Many of the images from my presentation are from Garrett's presentation above.

Questions?

References

1. Grolemund’. EDAWR: Expert data analysis with r. (2017).

2. Wickham, H. Tidy data. The Journal of Statistical Software 59, (2014).

3. Wickham, H. Tidyr: Easily tidy data with ’spread()’ and ’gather()’ functions. (2017).

4. Bache, S. M. & Wickham, H. Magrittr: A forward-pipe operator for r. (2014).

5. Wickham, H. & Chang, W. Ggplot2: Create elegant data visualisations using the grammar of graphics. (2016).

6. Wickham, H. & Francois, R. Dplyr: A grammar of data manipulation. (2016).

Who am I?

Who am I not?

What to Expect

Raise your hand if over the last year you have worked with data …

Raise your hand if you work in the …

Why use tidyR & dplyR?

Prepare your R environment

tidyR Efficient data reshaping

Tidy Data

Easy access to variables & Automattically preseve observations

Tidy Data

Easy access to variables & Automattically preseve observations

Characteristics of messy data2

Part 1: Is this data tidy?

Part 1: Is this data tidy?

How to make this data tidy?

How to make the data tidy? With gather()

How to make the data tidy? With gather()

Part 2: Is this data tidy?

Part 2: Is this data tidy?

How to make the data tidy? With spread()

How to make the data tidy? With spread()

Rename columns with rename()

2 helpful tidyR verbs

the pipe operator: %>%

Pipes: %>%

Pipes: %>%

Pipe into a plot & build ggplot layers with +

dplyR Efficient data transformation

Helpful dplyR functions

Accessing information

Joins

3 ways to Pick columns with select()

Helper functions with select

Pick rows with filter()

Create new variables with mutate()

Create new variables with mutate()

summarise() multiple values to one value

Sort values with arrange()

Sort values with arrange()

Sort values with arrange()

group_by()

group_by()

Joins

Joining data sets together

Join 2 datasets together

Join 2 datasets together

Filtering joins

Resources

Questions?

References

`tidyR` Efficient data reshaping

Characteristics of messy data²

How to make the data tidy? With `gather()`

How to make the data tidy? With `gather()`

How to make the data tidy? With `spread()`

How to make the data tidy? With `spread()`

Rename columns with `rename()`

2 helpful `tidyR` verbs

`dplyR` Efficient data transformation

Helpful `dplyR` functions

3 ways to Pick columns with `select()`

Pick rows with `filter()`

Create new variables with `mutate()`

Create new variables with `mutate()`

`summarise()` multiple values to one value

Sort values with `arrange()`

Sort values with `arrange()`

Sort values with `arrange()`

`group_by()`

`group_by()`