May 11th, 2017

Who am I?

  • 5th year PhD student in Ecology & Evolutionary Biology
    • Dissertation work: patterns of microbial diversity in freshwater lakes.
  • R enthusiast/user since 2014.
  • Graduate Teaching Consultant (with specialty in instructional technology) at the Center for Research on Learning and Teaching (CRLT)
  • Software Carpentry Instructor
  • FEMMES Executive board member (Females Excelling More in Math Engineering in Science)
  • Rock climber
  • Once I swam from Asia to Europe

Who am I not?

  • An expert on tidy data
  • An expert R programmer

What to Expect

Using datasets from the EDAWR package1

  • What's tidy data?2
  • Helpful tidyR verbs3
  • Pipes from the magrittr package4
    - Send piped output to ggplot25
  • Useful dplyR6 verbs

Raise your hand if over the last year you have worked with data …

  • daily?
  • A few times a week?
  • A few times a month?

Raise your hand if you work in the …

  • humanities?
  • social sciences?
  • STEM (science, technology, engineering, math)?

Why use tidyR & dplyR?

  • 50-80% of data analysis is spent on cleaning and preparing data2
    • These packages were developed to make data cleaning/preparation/reshaping more efficient and easy
  • Avoid "metadata hell" and being a culprit on #otherpeoplesdata

Prepare your R environment

library(tidyr)      # Let's organize our data!
library(dplyr)      # Let's manipulate our data!
library(EDAWR)      # Some data examples
library(ggplot2)    # For plots

tidyR Efficient data reshaping

Tidy Data

  1. Each variable is saved in its own column.
  2. Each observation is saved in its own row.
  3. Each "type" of observational unit is a single table.

Easy access to variables & Automattically preseve observations

Tidy Data

Easy access to variables & Automattically preseve observations

head(storms, n = 2)
## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2    Alex    45     1009 1998-07-27

Characteristics of messy data2

  1. Column headers are values, not variable names.
  2. Multiple variables are stored in one column.
  3. Variables are stored in both rows and columns.
  4. Multiple types of observational units are stored in the same table.
  5. A single observational unit is stored in multiple tables.

Part 1: Is this data tidy?

cases # Tuberculosis cases from 3 countries
##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

What variables are in this data?

Part 1: Is this data tidy?

cases # Tuberculosis cases from 3 countries
##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

No, column headers are values, not variable names.

3 variable names should be:
1. country
2. year
3. count or n

How to make this data tidy?

How to make the data tidy? With gather()

##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

gather(data, "year", "cases", 2:3)

How to make the data tidy? With gather()

gather() collapses multiple columns into 2 columns:
1. Key: Column with former column names. E.g. "year"
2. Value: Column with values in former column cells. E.g. "n"
3. Indices: Which columns to gather

gather(data, "key", "value", indices)

gather(cases, "year", "n", 2:4)
##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000
## 6      US 2012 14000
## 7      FR 2013  7000
## 8      DE 2013  6200
## 9      US 2013 13000

Part 2: Is this data tidy?

pollution # particulate load from 3 cities
##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

What variables are in this data?

Part 2: Is this data tidy?

pollution # particulate load from 3 cities
##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

No, multiple variables are stored in one column.

3 variable names should be:
1. city
2. Amount of large particles
3. Amount of small particles

How to make the data tidy? With spread()

##       city  size amount
## 1 New York large     23
## 2 New York small     14

spread(data, "key", "value")

How to make the data tidy? With spread()

spread() creates multiple columns from 2 columns:
1. Unique values in the key column become a column name. E.g. "size"
2. Each value in the value column becomes a cell in new columns. E.g. "amount"

spread(data, "key", "value")

spread(pollution, "size", "amount")
##       city large small
## 1  Beijing   121    56
## 2   London    22    16
## 3 New York    23    14

Rename columns with rename()

head(pollution, n = 3)
##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
rename(pollution, concentration = amount) 
##       city  size concentration
## 1 New York large            23
## 2 New York small            14
## 3   London large            22
## 4   London small            16
## 5  Beijing large           121
## 6  Beijing small            56

2 helpful tidyR verbs

gather() vs spread()

the pipe operator: %>%

Pipes: %>%

  • R's copy of the unix pipe: |
  • Will directly move the output of one function to another function
  • Reduces the number of variables needed
  • Increases code readability

Pipes: %>%

gather(cases, "year", "n", 2:4)

cases %>%                    # Dataframe in use
  gather("year", "n", 2:4)   # Spread the data into columns: size and amount
##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000
## 6      US 2012 14000
## 7      FR 2013  7000
## 8      DE 2013  6200
## 9      US 2013 13000

Also known as: cases %>% gather("year", "n", 2:4)

Pipe into a plot & build ggplot layers with +

cases %>%                       # Dataframe in use
  gather("year", "n", 2:4) %>%  # Spread the data into columns: size and amount
  ggplot(aes(x = country, y = n, color = country, fill = country)) +     # Plot the data 
    geom_boxplot(alpha = 0.5) + geom_jitter(size = 2.5) +theme_bw()

dplyR Efficient data transformation

Helpful dplyR functions

Accessing information

  1. Extract existing variables: select()
  2. Extract existing observations: filter()
  3. Calculate/derive new variables: mutate()
  4. summarise() multiple values into one value.
  5. group_by() groups factors together

Joins

left_join(), right_join(), anti_join(), inner_join(), and more

3 ways to Pick columns with select()

  • List the column names you do want
storms %>%  select(storm, wind)   # Pick only storm and wind columns
storms %>% select(storm:wind)     # Pick storm to wind columns
  • List the column names you don't want
storms %>% select(-pressure)            # Remove one column, pressure 
storms %>% select(-c(pressure, date))   # Remove multiple columns, pressure and date 
  • Select the column indices
storms %>% select(1:2)      # Pick only columns 1 & 2 

Helper functions with select

Select columns that:

  • contains(): Are based on a character string
  • starts_with() or ends_with(): Contain part of a character string
  • one_of(): Names are in a group of names
  • matches(): Name matches a regular expression
storms %>% select(contains("s")) %>% head(n = 3)
## # A tibble: 3 × 2
##     storm pressure
##     <chr>    <int>
## 1 Alberto     1007
## 2    Alex     1009
## 3 Allison     1005

Pick rows with filter()

storms %>% filter(wind >= 50) # Wind speeds equal to or greater than 50 mph
## # A tibble: 3 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2 Allison    65     1005 1995-06-03
## 3  Arlene    50     1010 1999-06-11
storms %>% filter(storm %in% c("Ana", "Alberto")) # Pick Ana & Alberto
## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2     Ana    40     1013 1997-06-30

Create new variables with mutate()

head(storms, n = 2)
## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2    Alex    45     1009 1998-07-27

Calculate the ratio of pressure to wind

storms %>% mutate(ratio = pressure/wind) %>% head(n = 2)
## # A tibble: 2 × 5
##     storm  wind pressure       date     ratio
##     <chr> <int>    <int>     <date>     <dbl>
## 1 Alberto   110     1007 2000-08-03  9.154545
## 2    Alex    45     1009 1998-07-27 22.422222

Create new variables with mutate()

storms %>% 
  mutate(ratio = pressure/wind) %>%     # Calculate the ratio of pressure to wind
  ggplot(aes(x = date, y = ratio)) + geom_line(size = 1.5, color = "grey") + 
  theme_bw() + geom_point(size = 4, color = "cornflowerblue")

summarise() multiple values to one value

head(storms, n = 1)
## # A tibble: 1 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
storms %>% 
  summarise(mean_wind = mean(wind),           # Mean wind of storms
            max_wind = max(wind),             # Max wind of storms
            median_pres = mean(pressure),     # Mean pressure of storms
            n = n())                          # Number of values measured
## # A tibble: 1 × 4
##   mean_wind max_wind median_pres     n
##       <dbl>    <int>       <dbl> <int>
## 1  59.16667      110        1009     6

Sort values with arrange()

Sort in ascending order

storms %>% arrange(wind)
## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1     Ana    40     1013 1997-06-30
## 2    Alex    45     1009 1998-07-27
## 3  Arthur    45     1010 1996-06-17
## 4  Arlene    50     1010 1999-06-11
## 5 Allison    65     1005 1995-06-03
## 6 Alberto   110     1007 2000-08-03

Sort values with arrange()

Sort in descending order

storms %>% arrange(desc(wind))  
## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2 Allison    65     1005 1995-06-03
## 3  Arlene    50     1010 1999-06-11
## 4    Alex    45     1009 1998-07-27
## 5  Arthur    45     1010 1996-06-17
## 6     Ana    40     1013 1997-06-30

Sort values with arrange()

Sort first by wind, then date

storms %>% arrange(wind, date)    
## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1     Ana    40     1013 1997-06-30
## 2  Arthur    45     1010 1996-06-17
## 3    Alex    45     1009 1998-07-27
## 4  Arlene    50     1010 1999-06-11
## 5 Allison    65     1005 1995-06-03
## 6 Alberto   110     1007 2000-08-03

group_by()

pollution %>% 
  group_by(city)    # Group by city 

group_by()

Efficient group statistics with group

pollution %>% 
  group_by(city) %>%
  summarise(mean_particles = mean(amount))
## # A tibble: 3 × 2
##       city mean_particles
##      <chr>          <dbl>
## 1  Beijing           88.5
## 2   London           19.0
## 3 New York           18.5

Joins

Joining data sets together

songs
##                  song  name
## 1 Across the Universe  John
## 2       Come Together  John
## 3      Hello, Goodbye  Paul
## 4           Peggy Sue Buddy
artists
##     name  plays
## 1 George  sitar
## 2   John guitar
## 3   Paul   bass
## 4  Ringo  drums

Join 2 datasets together

  • right_join(a, b, by = "V1"): Join matching rows from a to b
  • left_join(a, b, by = "V1"): Join matching rows from b to a
left_join(songs, artists, by = "name")

Join 2 datasets together

  • full_join(a, b, by = "V1"): Join data and retain everything
  • inner_join(a, b, by = "V1"): Join data but only keep rows in both sets
inner_join(songs, artists, by = "name")

Filtering joins

  • semi_join(a, b, by = "V1"): Keep all rows that have a match in b
  • anti_join(a, b, by = "V1"): Keep all rows that do not have a match in b
anti_join(songs, artists, by = "name")

Resources

Questions?

References

1. Grolemund’. EDAWR: Expert data analysis with r. (2017).

2. Wickham, H. Tidy data. The Journal of Statistical Software 59, (2014).

3. Wickham, H. Tidyr: Easily tidy data with ’spread()’ and ’gather()’ functions. (2017).

4. Bache, S. M. & Wickham, H. Magrittr: A forward-pipe operator for r. (2014).

5. Wickham, H. & Chang, W. Ggplot2: Create elegant data visualisations using the grammar of graphics. (2016).

6. Wickham, H. & Francois, R. Dplyr: A grammar of data manipulation. (2016).