Homework 1: Transforming data

Topics covered in this homework include:

dplyr and the five verbs
working with factors
third normal form
tidy data

Please calculate 2+2 in the space below.

2+2

## [1] 4

Set-up the soccer data

The code below clears memory and then loads dplyr and the soccer data.

A robustness check

In lab, we calculated the rate of red cards for players of different skin tone. Now, we are going to see how robust our conclusions were to some of the choices that we made in the analysis. In particular, it is important to know that in soccer there are actually two ways to get a red card: a direct red card and getting two yellow cards (which equals one red card).

Create a table like the one were made in lab where the outcome of interest is rate of direct red cards.

soccer.data %>%
  filter(!(is.na(rater1) | is.na(rater2))) %>%
  mutate(skin.color = (rater1 + rater2) / 2) %>%
  group_by(skin.color) %>% 
  summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
  mutate(red.rate=total.reds / total.games)

## Source: local data frame [9 x 4]
## 
##   skin.color total.reds total.games    red.rate
## 1      0.000        381       99636 0.003823919
## 2      0.125        198       54137 0.003657388
## 3      0.250        535      120574 0.004437109
## 4      0.375         83       15936 0.005208333
## 5      0.500        125       25415 0.004918355
## 6      0.625         15        6493 0.002310180
## 7      0.750        100       20800 0.004807692
## 8      0.875         41        8134 0.005040570
## 9      1.000        111       21942 0.005058791

Imagine that you submitted the table above in a paper (Of course, in a real paper you would create a graph, but we have not learned ggplot2 yet.) Create a table like the one above but where the outcome of interest is rate of all forms of red cards (direct red cards + two yellow cards). The column red.cards is direct red cards and the column yellow.reds is the red cards that result from two yellow cards.

soccer.data %>%
  filter(!(is.na(rater1) | is.na(rater2))) %>%
  mutate(skin.color = (rater1 + rater2) / 2) %>%
  mutate(all.reds = red.cards + yellow.reds) %>%
  group_by(skin.color) %>% 
  summarise(total.all.reds = sum(all.reds), total.games = sum(games)) %>%
  mutate(red.rate=total.all.reds / total.games)

## Source: local data frame [9 x 4]
## 
##   skin.color total.all.reds total.games    red.rate
## 1      0.000            761       99636 0.007637802
## 2      0.125            393       54137 0.007259361
## 3      0.250           1039      120574 0.008617115
## 4      0.375            174       15936 0.010918675
## 5      0.500            221       25415 0.008695652
## 6      0.625             38        6493 0.005852456
## 7      0.750            179       20800 0.008605769
## 8      0.875             73        8134 0.008974674
## 9      1.000            214       21942 0.009752985

In words, compare your answers in questions 2 and 3. Do this choice make a difference?

It does not seem to make a difference if you use only direct red cards or all types of red cards.

Looking at subsets of the data, by country

Imagine that you presented these results at ASA, and an audience member speculated the relationship between skin tone and red cards would be different in the different soccer leagues.

Create a table that shows, for each league, the rate of red cards by skin color. In this case, please use direct red cards (red.cards) as you outcome.

soccer.data %>%
  filter(!(is.na(rater1) | is.na(rater2))) %>%
  mutate(skin.color = (rater1 + rater2) / 2) %>%
  group_by(league.country, skin.color) %>%
  summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
  mutate(red.rate=total.reds / total.games)

## Source: local data frame [36 x 5]
## Groups: league.country
## 
##    league.country skin.color total.reds total.games     red.rate
## 1         England      0.000        118       30788 0.0038326621
## 2         England      0.125         53       17736 0.0029882724
## 3         England      0.250        101       29823 0.0033866479
## 4         England      0.375          3        2058 0.0014577259
## 5         England      0.500         53       10849 0.0048852429
## 6         England      0.625          2        2355 0.0008492569
## 7         England      0.750         40        7032 0.0056882821
## 8         England      0.875         16        3188 0.0050188206
## 9         England      1.000         41       10384 0.0039483821
## 10         France      0.000         52        9696 0.0053630363
## ..            ...        ...        ...         ...          ...

# it is really hard to see much in that table, we can also make a graph
library(ggplot2)
soccer.data %>%
  filter(!(is.na(rater1) | is.na(rater2))) %>%
  mutate(skin.color = (rater1 + rater2) / 2) %>%
  group_by(league.country, skin.color) %>%
  summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
  mutate(red.rate=total.reds / total.games) %>%
  ggplot( aes(x=skin.color, y=red.rate, color=league.country, size=total.games)) + geom_point() + stat_smooth(method = "loess", se=FALSE)

In words, what would you conclude from your response to 5?

The overall rate of red cards is different in the different leagues, but the relationship between skin color and red cards seems pretty similar.

Watch how this works with a different dataset: Gapminder

Just to show you that this all works with different data, you will now do some analysis with the Gapminder data, as currated and cleaned by Jenny Bryan.

## Observations: 1704
## Variables:
## $ country   (fctr) Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent (fctr) Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      (dbl) 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   (dbl) 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       (dbl) 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap (dbl) 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

## Source: local data frame [6 x 6]
## 
##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia 1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia 1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia 1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia 1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia 1977  38.438 14880372  786.1134

## Source: local data frame [6 x 6]
## 
##    country continent year lifeExp      pop gdpPercap
## 1 Zimbabwe    Africa 1982  60.363  7636524  788.8550
## 2 Zimbabwe    Africa 1987  62.351  9216418  706.1573
## 3 Zimbabwe    Africa 1992  60.377 10704340  693.4208
## 4 Zimbabwe    Africa 1997  46.809 11404948  792.4500
## 5 Zimbabwe    Africa 2002  39.989 11926563  672.0386
## 6 Zimbabwe    Africa 2007  43.487 12311143  469.7093

Is this data in third normal form?

No.  Continent is not part of the primary key, and it does not "refer to the key, the whole key, and nothing but the key."

The continent that each country belongs to should only be stored once.  As is, it could get out of sync.

Explain:

Continent is not part of the primary key, and it does not "refer to the key, the whole key, and nothing but the key."

The continent that each country belongs to should only be stored once.  As is, it could get out of sync.

A better way to create a new table that shows which continent each country is in.

Is this an optimal structure for data storage?

No.  The continent that each country belongs to should only be stored once.  As is, it could get out of sync.

Is this a sensible structure for data analysis?

Yes.  It makes it easy to filter by continent. The best way to do this is to store the data in third normal form and then do joins duing the analysis stage.  We'll learn more about joins later in the semester.

For each continent, show the mean GDP in each of the years in the data.

gapminder %>% 
  group_by(continent, year) %>%
  summarise(mean(gdpPercap))

## Source: local data frame [60 x 3]
## Groups: continent
## 
##    continent year mean(gdpPercap)
## 1     Africa 1952        1252.572
## 2     Africa 1957        1385.236
## 3     Africa 1962        1598.079
## 4     Africa 1967        2050.364
## 5     Africa 1972        2339.616
## 6     Africa 1977        2585.939
## 7     Africa 1982        2481.593
## 8     Africa 1987        2282.669
## 9     Africa 1992        2281.810
## 10    Africa 1997        2378.760
## ..       ...  ...             ...

Which country had the highest GDP per captia in Africa in 1952? Note you don’t need to produce a data.frame with a single country to answer this question. A data.frame with the appropriate countries sorted is enough.

gapminder %>%
  filter(continent=="Africa", year==1952) %>%
  arrange(desc(gdpPercap)) %>%
  select(country, gdpPercap)

## Source: local data frame [52 x 2]
## 
##         country gdpPercap
## 1  South Africa  4725.296
## 2         Gabon  4293.476
## 3        Angola  3520.610
## 4       Reunion  2718.885
## 5      Djibouti  2669.529
## 6       Algeria  2449.008
## 7       Namibia  2423.780
## 8         Libya  2387.548
## 9   Congo, Rep.  2125.621
## 10    Mauritius  1967.956
## ..          ...       ...

Which country had the highest GDP (not GDP per captia) in any year in the data? Note you don’t need to produce a single country to answer this question. A data.frame with the appropriate countries sorted is enough.

gapminder %>%
  mutate(gdp = gdpPercap * pop) %>%
  arrange(desc(gdp)) %>%
  select(country, year, gdp)

## Source: local data frame [1,704 x 3]
## 
##          country year          gdp
## 1  United States 2007 1.293446e+13
## 2  United States 2002 1.124728e+13
## 3  United States 1997 9.761353e+12
## 4  United States 1992 8.221624e+12
## 5  United States 1987 7.256026e+12
## 6          China 2007 6.539501e+12
## 7  United States 1982 5.806915e+12
## 8  United States 1977 5.301732e+12
## 9  United States 1972 4.577000e+12
## 10         Japan 2007 4.035135e+12
## ..           ...  ...          ...

Which continent had the most variation in life expectancy in 2007? Note you don’t need to produce a single country to answer this question. A data.frame with the appropriate countries sorted is enough.

gapminder %>%
  filter(year=="2007") %>%
  group_by(continent) %>%
  summarise(life.exp.var = var(lifeExp))

## Source: local data frame [5 x 2]
## 
##   continent life.exp.var
## 1    Africa   92.7519364
## 2  Americas   19.7220159
## 3      Asia   63.4209074
## 4    Europe    8.8792835
## 5   Oceania    0.5314805

Optional challenge: Create a question that will require you to use all 5 dplyr verbs: filter, arrange, select, mutate, and summarise. Then, write a query to answer it.

[your text here, optional]

Challenge problem: Now show the code to answer question 15.

# your code here; optional

More practice with data structures

R comes with the dataset ldeaths, which records the monthly deaths from bronchitis, emphysema and asthma in the UK, 1974–1979. To see the data type ldeaths. For more information type “?ldeaths”

Is this data tidy?

no

Explain

There are three variables: year, month, and count.  It is not the case that each row is an observation and each column is a variable.

R comes with the dataset mtcars, which shows fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models), as taken from the 1974 Motor Trend US magazine. To see the data type mtcars. For more information type “?mtcars”

Is this data tidy?

yes

Explain

Each row is an observation, and each column is a variable.  One way that it could be better would be to have the name of the car as a column, instead of as a rowname.

R comes with the dataset quakes, which shows 1000 seismic events near Fiji. To see the data type quakes. For more information type “?quakes”

Is this data tidy?

yes

Explain

Each row is an observation and each column is a variable.  One way that it could be better would be to have some kind of unique identifier for each earthquake.

The command below is helpful for debugging, please don’t change it

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.0 dplyr_0.4.1  
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-4 DBI_0.3.1        digest_0.6.8    
##  [5] evaluate_0.5.5   formatR_1.0      grid_3.1.2       gtable_0.1.2    
##  [9] htmltools_0.2.6  knitr_1.9        labeling_0.3     lazyeval_0.1.10 
## [13] magrittr_1.5     MASS_7.3-40      munsell_0.4.2    parallel_3.1.2  
## [17] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.4      reshape2_1.4.1  
## [21] rmarkdown_0.5.1  scales_0.2.4     stringr_0.6.2    tools_3.1.2     
## [25] yaml_2.1.13