Topics covered in this homework include:
2+2
## [1] 4
The code below clears memory and then loads dplyr and the soccer data.
In lab, we calculated the rate of red cards for players of different skin tone. Now, we are going to see how robust our conclusions were to some of the choices that we made in the analysis. In particular, it is important to know that in soccer there are actually two ways to get a red card: a direct red card and getting two yellow cards (which equals one red card).
soccer.data %>%
filter(!(is.na(rater1) | is.na(rater2))) %>%
mutate(skin.color = (rater1 + rater2) / 2) %>%
group_by(skin.color) %>%
summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
mutate(red.rate=total.reds / total.games)
## Source: local data frame [9 x 4]
##
## skin.color total.reds total.games red.rate
## 1 0.000 381 99636 0.003823919
## 2 0.125 198 54137 0.003657388
## 3 0.250 535 120574 0.004437109
## 4 0.375 83 15936 0.005208333
## 5 0.500 125 25415 0.004918355
## 6 0.625 15 6493 0.002310180
## 7 0.750 100 20800 0.004807692
## 8 0.875 41 8134 0.005040570
## 9 1.000 111 21942 0.005058791
ggplot2
yet.) Create a table like the one above but where the outcome of interest is rate of all forms of red cards (direct red cards + two yellow cards). The column red.cards is direct red cards and the column yellow.reds is the red cards that result from two yellow cards.soccer.data %>%
filter(!(is.na(rater1) | is.na(rater2))) %>%
mutate(skin.color = (rater1 + rater2) / 2) %>%
mutate(all.reds = red.cards + yellow.reds) %>%
group_by(skin.color) %>%
summarise(total.all.reds = sum(all.reds), total.games = sum(games)) %>%
mutate(red.rate=total.all.reds / total.games)
## Source: local data frame [9 x 4]
##
## skin.color total.all.reds total.games red.rate
## 1 0.000 761 99636 0.007637802
## 2 0.125 393 54137 0.007259361
## 3 0.250 1039 120574 0.008617115
## 4 0.375 174 15936 0.010918675
## 5 0.500 221 25415 0.008695652
## 6 0.625 38 6493 0.005852456
## 7 0.750 179 20800 0.008605769
## 8 0.875 73 8134 0.008974674
## 9 1.000 214 21942 0.009752985
It does not seem to make a difference if you use only direct red cards or all types of red cards.
Imagine that you presented these results at ASA, and an audience member speculated the relationship between skin tone and red cards would be different in the different soccer leagues.
soccer.data %>%
filter(!(is.na(rater1) | is.na(rater2))) %>%
mutate(skin.color = (rater1 + rater2) / 2) %>%
group_by(league.country, skin.color) %>%
summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
mutate(red.rate=total.reds / total.games)
## Source: local data frame [36 x 5]
## Groups: league.country
##
## league.country skin.color total.reds total.games red.rate
## 1 England 0.000 118 30788 0.0038326621
## 2 England 0.125 53 17736 0.0029882724
## 3 England 0.250 101 29823 0.0033866479
## 4 England 0.375 3 2058 0.0014577259
## 5 England 0.500 53 10849 0.0048852429
## 6 England 0.625 2 2355 0.0008492569
## 7 England 0.750 40 7032 0.0056882821
## 8 England 0.875 16 3188 0.0050188206
## 9 England 1.000 41 10384 0.0039483821
## 10 France 0.000 52 9696 0.0053630363
## .. ... ... ... ... ...
# it is really hard to see much in that table, we can also make a graph
library(ggplot2)
soccer.data %>%
filter(!(is.na(rater1) | is.na(rater2))) %>%
mutate(skin.color = (rater1 + rater2) / 2) %>%
group_by(league.country, skin.color) %>%
summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
mutate(red.rate=total.reds / total.games) %>%
ggplot( aes(x=skin.color, y=red.rate, color=league.country, size=total.games)) + geom_point() + stat_smooth(method = "loess", se=FALSE)
The overall rate of red cards is different in the different leagues, but the relationship between skin color and red cards seems pretty similar.
Just to show you that this all works with different data, you will now do some analysis with the Gapminder data, as currated and cleaned by Jenny Bryan.
## Observations: 1704
## Variables:
## $ country (fctr) Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent (fctr) Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year (dbl) 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp (dbl) 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop (dbl) 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap (dbl) 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
## Source: local data frame [6 x 6]
##
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## Source: local data frame [6 x 6]
##
## country continent year lifeExp pop gdpPercap
## 1 Zimbabwe Africa 1982 60.363 7636524 788.8550
## 2 Zimbabwe Africa 1987 62.351 9216418 706.1573
## 3 Zimbabwe Africa 1992 60.377 10704340 693.4208
## 4 Zimbabwe Africa 1997 46.809 11404948 792.4500
## 5 Zimbabwe Africa 2002 39.989 11926563 672.0386
## 6 Zimbabwe Africa 2007 43.487 12311143 469.7093
No. Continent is not part of the primary key, and it does not "refer to the key, the whole key, and nothing but the key."
The continent that each country belongs to should only be stored once. As is, it could get out of sync.
Continent is not part of the primary key, and it does not "refer to the key, the whole key, and nothing but the key."
The continent that each country belongs to should only be stored once. As is, it could get out of sync.
A better way to create a new table that shows which continent each country is in.
No. The continent that each country belongs to should only be stored once. As is, it could get out of sync.
Yes. It makes it easy to filter by continent. The best way to do this is to store the data in third normal form and then do joins duing the analysis stage. We'll learn more about joins later in the semester.
gapminder %>%
group_by(continent, year) %>%
summarise(mean(gdpPercap))
## Source: local data frame [60 x 3]
## Groups: continent
##
## continent year mean(gdpPercap)
## 1 Africa 1952 1252.572
## 2 Africa 1957 1385.236
## 3 Africa 1962 1598.079
## 4 Africa 1967 2050.364
## 5 Africa 1972 2339.616
## 6 Africa 1977 2585.939
## 7 Africa 1982 2481.593
## 8 Africa 1987 2282.669
## 9 Africa 1992 2281.810
## 10 Africa 1997 2378.760
## .. ... ... ...
gapminder %>%
filter(continent=="Africa", year==1952) %>%
arrange(desc(gdpPercap)) %>%
select(country, gdpPercap)
## Source: local data frame [52 x 2]
##
## country gdpPercap
## 1 South Africa 4725.296
## 2 Gabon 4293.476
## 3 Angola 3520.610
## 4 Reunion 2718.885
## 5 Djibouti 2669.529
## 6 Algeria 2449.008
## 7 Namibia 2423.780
## 8 Libya 2387.548
## 9 Congo, Rep. 2125.621
## 10 Mauritius 1967.956
## .. ... ...
gapminder %>%
mutate(gdp = gdpPercap * pop) %>%
arrange(desc(gdp)) %>%
select(country, year, gdp)
## Source: local data frame [1,704 x 3]
##
## country year gdp
## 1 United States 2007 1.293446e+13
## 2 United States 2002 1.124728e+13
## 3 United States 1997 9.761353e+12
## 4 United States 1992 8.221624e+12
## 5 United States 1987 7.256026e+12
## 6 China 2007 6.539501e+12
## 7 United States 1982 5.806915e+12
## 8 United States 1977 5.301732e+12
## 9 United States 1972 4.577000e+12
## 10 Japan 2007 4.035135e+12
## .. ... ... ...
gapminder %>%
filter(year=="2007") %>%
group_by(continent) %>%
summarise(life.exp.var = var(lifeExp))
## Source: local data frame [5 x 2]
##
## continent life.exp.var
## 1 Africa 92.7519364
## 2 Americas 19.7220159
## 3 Asia 63.4209074
## 4 Europe 8.8792835
## 5 Oceania 0.5314805
dplyr
verbs: filter
, arrange
, select
, mutate
, and summarise
. Then, write a query to answer it.[your text here, optional]
# your code here; optional
R
comes with the dataset ldeaths
, which records the monthly deaths from bronchitis, emphysema and asthma in the UK, 1974–1979. To see the data type ldeaths
. For more information type “?ldeaths”
no
There are three variables: year, month, and count. It is not the case that each row is an observation and each column is a variable.
R
comes with the dataset mtcars
, which shows fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models), as taken from the 1974 Motor Trend US magazine. To see the data type mtcars
. For more information type “?mtcars”
yes
Each row is an observation, and each column is a variable. One way that it could be better would be to have the name of the car as a column, instead of as a rowname.
R
comes with the dataset quakes
, which shows 1000 seismic events near Fiji. To see the data type quakes
. For more information type “?quakes”
yes
Each row is an observation and each column is a variable. One way that it could be better would be to have some kind of unique identifier for each earthquake.
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.0 dplyr_0.4.1
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 colorspace_1.2-4 DBI_0.3.1 digest_0.6.8
## [5] evaluate_0.5.5 formatR_1.0 grid_3.1.2 gtable_0.1.2
## [9] htmltools_0.2.6 knitr_1.9 labeling_0.3 lazyeval_0.1.10
## [13] magrittr_1.5 MASS_7.3-40 munsell_0.4.2 parallel_3.1.2
## [17] plyr_1.8.1 proto_0.3-10 Rcpp_0.11.4 reshape2_1.4.1
## [21] rmarkdown_0.5.1 scales_0.2.4 stringr_0.6.2 tools_3.1.2
## [25] yaml_2.1.13