May 11th, 2017

## Who am I?

• 5th year PhD student in Ecology & Evolutionary Biology
• Dissertation work: patterns of microbial diversity in freshwater lakes.
• R enthusiast/user since 2014.
• Graduate Teaching Consultant (with specialty in instructional technology) at the Center for Research on Learning and Teaching (CRLT)
• Software Carpentry Instructor
• FEMMES Executive board member (Females Excelling More in Math Engineering in Science)
• Rock climber
• Once I swam from Asia to Europe

## Who am I not?

• An expert on tidy data
• An expert R programmer

## What to Expect

Using datasets from the EDAWR package1

• What's tidy data?2
• Helpful `tidyR` verbs3
• Pipes from the `magrittr` package4
- Send piped output to `ggplot2`5
• Useful `dplyR`6 verbs

## Raise your hand if over the last year you have worked with data …

• daily?
• A few times a week?
• A few times a month?

## Raise your hand if you work in the …

• humanities?
• social sciences?
• STEM (science, technology, engineering, math)?

## Why use tidyR & dplyR?

• 50-80% of data analysis is spent on cleaning and preparing data2
• These packages were developed to make data cleaning/preparation/reshaping more efficient and easy
• Avoid "metadata hell" and being a culprit on #otherpeoplesdata

```library(tidyr)      # Let's organize our data!
library(dplyr)      # Let's manipulate our data!
library(EDAWR)      # Some data examples
library(ggplot2)    # For plots```

## Tidy Data

1. Each variable is saved in its own column.
2. Each observation is saved in its own row.
3. Each "type" of observational unit is a single table.

## Tidy Data

`head(storms, n = 2)`
```## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2    Alex    45     1009 1998-07-27```

## Characteristics of messy data2

1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables.

## Part 1: Is this data tidy?

`cases # Tuberculosis cases from 3 countries`
```##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000```

What variables are in this data?

## Part 1: Is this data tidy?

`cases # Tuberculosis cases from 3 countries`
```##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000```

No, column headers are values, not variable names.

3 variable names should be:
1. `country`
2. `year`
3. `count` or `n`

## How to make the data tidy? With `gather()`

```##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000```

`gather(data, "year", "cases", 2:3)`

## How to make the data tidy? With `gather()`

`gather()` collapses multiple columns into 2 columns:
1. Key: Column with former column names. E.g. "year"
2. Value: Column with values in former column cells. E.g. "n"
3. Indices: Which columns to gather

`gather(data, "key", "value", indices)`

`gather(cases, "year", "n", 2:4)`
```##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000
## 6      US 2012 14000
## 7      FR 2013  7000
## 8      DE 2013  6200
## 9      US 2013 13000```

## Part 2: Is this data tidy?

`pollution # particulate load from 3 cities`
```##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56```

What variables are in this data?

## Part 2: Is this data tidy?

`pollution # particulate load from 3 cities`
```##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56```

No, multiple variables are stored in one column.

3 variable names should be:
1. `city`
2. Amount of `large` particles
3. Amount of `small` particles

## How to make the data tidy? With `spread()`

```##       city  size amount
## 1 New York large     23
## 2 New York small     14```

`spread(data, "key", "value")`

## How to make the data tidy? With `spread()`

`spread()` creates multiple columns from 2 columns:
1. Unique values in the key column become a column name. E.g. "size"
2. Each value in the value column becomes a cell in new columns. E.g. "amount"

`spread(data, "key", "value")`

`spread(pollution, "size", "amount")`
```##       city large small
## 1  Beijing   121    56
## 2   London    22    16
## 3 New York    23    14```

## Rename columns with `rename()`

`head(pollution, n = 3)`
```##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22```
`rename(pollution, concentration = amount) `
```##       city  size concentration
## 1 New York large            23
## 2 New York small            14
## 3   London large            22
## 4   London small            16
## 5  Beijing large           121
## 6  Beijing small            56```

## 2 helpful `tidyR` verbs

`gather()` vs `spread()`

## Pipes: %>%

• R's copy of the unix pipe: |
• Will directly move the output of one function to another function
• Reduces the number of variables needed

## Pipes: %>%

`gather(cases, "year", "n", 2:4)`

```cases %>%                    # Dataframe in use
gather("year", "n", 2:4)   # Spread the data into columns: size and amount```
```##   country year     n
## 1      FR 2011  7000
## 2      DE 2011  5800
## 3      US 2011 15000
## 4      FR 2012  6900
## 5      DE 2012  6000
## 6      US 2012 14000
## 7      FR 2013  7000
## 8      DE 2013  6200
## 9      US 2013 13000```

Also known as: `cases %>% gather("year", "n", 2:4)`

## Pipe into a plot & build ggplot layers with +

```cases %>%                       # Dataframe in use
gather("year", "n", 2:4) %>%  # Spread the data into columns: size and amount
ggplot(aes(x = country, y = n, color = country, fill = country)) +     # Plot the data
geom_boxplot(alpha = 0.5) + geom_jitter(size = 2.5) +theme_bw()```

## Helpful `dplyR` functions

### Accessing information

1. Extract existing variables: `select()`
2. Extract existing observations: `filter()`
3. Calculate/derive new variables: `mutate()`
4. `summarise()` multiple values into one value.
5. `group_by()` groups factors together

### Joins

`left_join()`, `right_join()`, `anti_join()`, `inner_join()`, and more

## 3 ways to Pick columns with `select()`

• List the column names you do want
```storms %>%  select(storm, wind)   # Pick only storm and wind columns
storms %>% select(storm:wind)     # Pick storm to wind columns```
• List the column names you don't want
```storms %>% select(-pressure)            # Remove one column, pressure
storms %>% select(-c(pressure, date))   # Remove multiple columns, pressure and date ```
• Select the column indices
`storms %>% select(1:2)      # Pick only columns 1 & 2 `

## Helper functions with select

Select columns that:

• `contains()`: Are based on a character string
• `starts_with()` or `ends_with()`: Contain part of a character string
• `one_of()`: Names are in a group of names
• `matches()`: Name matches a regular expression
`storms %>% select(contains("s")) %>% head(n = 3)`
```## # A tibble: 3 × 2
##     storm pressure
##     <chr>    <int>
## 1 Alberto     1007
## 2    Alex     1009
## 3 Allison     1005```

## Pick rows with `filter()`

`storms %>% filter(wind >= 50) # Wind speeds equal to or greater than 50 mph`
```## # A tibble: 3 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2 Allison    65     1005 1995-06-03
## 3  Arlene    50     1010 1999-06-11```
`storms %>% filter(storm %in% c("Ana", "Alberto")) # Pick Ana & Alberto`
```## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2     Ana    40     1013 1997-06-30```

## Create new variables with `mutate()`

`head(storms, n = 2)`
```## # A tibble: 2 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2    Alex    45     1009 1998-07-27```

Calculate the ratio of pressure to wind

`storms %>% mutate(ratio = pressure/wind) %>% head(n = 2)`
```## # A tibble: 2 × 5
##     storm  wind pressure       date     ratio
##     <chr> <int>    <int>     <date>     <dbl>
## 1 Alberto   110     1007 2000-08-03  9.154545
## 2    Alex    45     1009 1998-07-27 22.422222```

## Create new variables with `mutate()`

```storms %>%
mutate(ratio = pressure/wind) %>%     # Calculate the ratio of pressure to wind
ggplot(aes(x = date, y = ratio)) + geom_line(size = 1.5, color = "grey") +
theme_bw() + geom_point(size = 4, color = "cornflowerblue")```

## `summarise()` multiple values to one value

`head(storms, n = 1)`
```## # A tibble: 1 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03```
```storms %>%
summarise(mean_wind = mean(wind),           # Mean wind of storms
max_wind = max(wind),             # Max wind of storms
median_pres = mean(pressure),     # Mean pressure of storms
n = n())                          # Number of values measured```
```## # A tibble: 1 × 4
##   mean_wind max_wind median_pres     n
##       <dbl>    <int>       <dbl> <int>
## 1  59.16667      110        1009     6```

## Sort values with `arrange()`

Sort in ascending order

`storms %>% arrange(wind)`
```## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1     Ana    40     1013 1997-06-30
## 2    Alex    45     1009 1998-07-27
## 3  Arthur    45     1010 1996-06-17
## 4  Arlene    50     1010 1999-06-11
## 5 Allison    65     1005 1995-06-03
## 6 Alberto   110     1007 2000-08-03```

## Sort values with `arrange()`

Sort in descending order

`storms %>% arrange(desc(wind))  `
```## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1 Alberto   110     1007 2000-08-03
## 2 Allison    65     1005 1995-06-03
## 3  Arlene    50     1010 1999-06-11
## 4    Alex    45     1009 1998-07-27
## 5  Arthur    45     1010 1996-06-17
## 6     Ana    40     1013 1997-06-30```

## Sort values with `arrange()`

Sort first by wind, then date

`storms %>% arrange(wind, date)    `
```## # A tibble: 6 × 4
##     storm  wind pressure       date
##     <chr> <int>    <int>     <date>
## 1     Ana    40     1013 1997-06-30
## 2  Arthur    45     1010 1996-06-17
## 3    Alex    45     1009 1998-07-27
## 4  Arlene    50     1010 1999-06-11
## 5 Allison    65     1005 1995-06-03
## 6 Alberto   110     1007 2000-08-03```

## `group_by()`

```pollution %>%
group_by(city)    # Group by city ```

## `group_by()`

Efficient group statistics with `group`

```pollution %>%
group_by(city) %>%
summarise(mean_particles = mean(amount))```
```## # A tibble: 3 × 2
##       city mean_particles
##      <chr>          <dbl>
## 1  Beijing           88.5
## 2   London           19.0
## 3 New York           18.5```

## Joining data sets together

`songs`
```##                  song  name
## 1 Across the Universe  John
## 2       Come Together  John
## 3      Hello, Goodbye  Paul
## 4           Peggy Sue Buddy```
`artists`
```##     name  plays
## 1 George  sitar
## 2   John guitar
## 3   Paul   bass
## 4  Ringo  drums```

## Join 2 datasets together

• `right_join(a, b, by = "V1")`: Join matching rows from a to b
• `left_join(a, b, by = "V1")`: Join matching rows from b to a
`left_join(songs, artists, by = "name")`