Homework 3: Teachers’ Union Needs Your Help

Topics covered in this homework include:

regression
residuals and diagnostics
dummy variables and interactions
ggplot2

Imagine that you’ve been urged by the teachers’ union to show that higher teacher pay leads to better education outcomes. Of course, you don’t do advocacy research — you are a seeker of truth — but you decide to investigate this questions scientifically using data about SAT scores and other educational indicators at the state level. For now we can pretend that this is the only available data (it comes from John Fox’s website).

Read the data documentation and use the code below to load it

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Before doing anything else, just play with the data to get a sense of it.

Make a scatter plot showing the relationship between average teacher pay and average sat score at the state level. To do this you might have to create a new variable. And, overlay a regression line on the plot.

data <- mutate(data, sat.overall = sat.verbal + sat.math)
library(ggplot2)
p  <- ggplot(data, aes(x = teacher.pay, y = sat.overall))
p + geom_point() + stat_smooth(method = "lm")

Fit a simple regression model to predict total SAT score based on teacher pay.

fit <- lm(sat.overall ~ teacher.pay, data = data)
library(broom)
tidy(fit)

##          term    estimate std.error statistic      p.value
## 1 (Intercept) 1234.292190 51.053726  24.17634 6.894231e-29
## 2 teacher.pay   -4.823482  1.401967  -3.44051 1.195738e-03

Explain the substantive conclusion that you would draw from the scatter plot and regression analysis.

It looks like students get lower SAT scores in states with higher teacher pay.  Of course, correlation does not imply caustion.

You took these results back to the teachers’ union and they didn’t believe them. You didn’t believe them either so you decided to investigate further.

Use the augment function from the broom package to add the predicted values, residulats, and so on to the data.

data.aug <- augment(fit, data)
head(data.aug)

##   region population sat.verbal sat.math percent.taking percent.no.hs
## 1    ESC       4273        565      558              8          33.1
## 2    PAC        607        521      513             47          13.4
## 3    MTN       4428        525      521             28          21.3
## 4    WSC       2510        566      550              6          33.7
## 5    PAC      31878        495      511             45          23.8
## 6    MTN       3823        536      538             30          15.6
##   teacher.pay state sat.overall   .fitted   .se.fit    .resid       .hat
## 1        31.3    AL        1123 1083.3172 10.775316  39.68280 0.03047912
## 2        49.6    AK        1034  995.0475 21.074380  38.95252 0.11658751
## 3        32.5    AZ        1046 1077.5290  9.863304 -31.52902 0.02553802
## 4        29.3    AR        1116 1092.9642 12.651397  23.03584 0.04201645
## 5        43.1    CA        1006 1026.4001 13.299027 -20.40011 0.04642823
## 6        35.4    CO        1074 1063.5409  8.669867  10.45908 0.01973182
##     .sigma      .cooksd .std.resid
## 1 62.08811 0.0067020118  0.6529726
## 2 62.07245 0.0297514796  0.6714676
## 3 62.18938 0.0035090605 -0.5174868
## 4 62.26742 0.0031887733  0.3813258
## 5 62.28707 0.0027890278 -0.3384753
## 6 62.34137 0.0002948334  0.1711560

Which states have the highest predicted SAT scores?

data.aug <- arrange(data.aug, desc(.fitted))
head(data.aug)

##   region population sat.verbal sat.math percent.taking percent.no.hs
## 1    WNC        732        574      566              5          22.9
## 2    WSC       4351        559      550              9          31.7
## 3    WNC        644        596      599              5          23.3
## 4    ESC       2716        569      557              4          35.7
## 5    WSC       3301        566      557              8          25.4
## 6    WSC       2510        566      550              6          33.7
##   teacher.pay state sat.overall  .fitted  .se.fit    .resid       .hat
## 1        26.3    SD        1140 1107.435 15.98331 32.565390 0.06706188
## 2        26.8    LA        1109 1105.023 15.39831  3.977131 0.06224269
## 3        27.0    ND        1195 1104.058 15.16706 90.941828 0.06038725
## 4        27.7    MS        1126 1100.682 14.37148 25.318265 0.05421826
## 5        28.4    OK        1123 1097.305 13.60020 25.694703 0.04855491
## 6        29.3    AR        1116 1092.964 12.65140 23.035836 0.04201645
##     .sigma      .cooksd .std.resid
## 1 62.16984 0.0107249442 0.54626210
## 2 62.35720 0.0001469466 0.06654201
## 3 60.87198 0.0742484176 1.52005896
## 4 62.24670 0.0050996927 0.42180302
## 5 62.24400 0.0046479913 0.42679855
## 6 62.26742 0.0031887733 0.38132578

Does this make sense? Note, if you are not from the US, you might need to ask someone from the US. Explain.

No, it does not make sense.  I would not expect South Dakota, Lousiana, and North Dakota to have the highest predicted SAT scores.

Now fit a regression model that predicts SAT score with teacher pay and precent of students taking the SAT.

fit <- lm(sat.overall ~ teacher.pay + percent.taking, data = data)
tidy(fit)

##             term    estimate  std.error  statistic      p.value
## 1    (Intercept) 1093.680373 29.2980158  37.329503 4.021705e-37
## 2    teacher.pay    1.567450  0.9184123   1.706695 9.434129e-02
## 3 percent.taking   -2.500972  0.2175266 -11.497312 2.165614e-15

Explain how this changes your interpration of the relationship between teacher pay and SAT scores.

States with the highest SAT scores had the fewest students taking the SAT.  Once this fact was added to the model, the parameters changes and teacher pay now seems to be positive related to SAT scores.

Occupational prestige

Now we are going to understand the relationship between typical education, income, job type, and occupation prestigue using the data from Duncan. You can read the documentation here

Here’s some code to read in and clean the data. And for the purposes of this assignment we are going to exclude professionals. In other words, we are only concerned about white collar and blue collar occupations. Again, notice that the unit here is occupations.

Run a regression model to predict the prestige of an occupation based on the level of education of people in that occupation (measured by the percentage of people in the field who have graduated from high school in 1950).

fit <- lm(prestige ~ education, data=data)
fit.df <- tidy(fit)
print(fit.df)

##          term  estimate std.error statistic     p.value
## 1 (Intercept) 9.8423134 6.1154709  1.609412 0.120082722
## 2   education 0.4797531 0.1601203  2.996204 0.006093486

Make a plot showing the data and the model that you fit.

inter <- as.numeric(filter(fit.df, term=="(Intercept)") %>%
                    select(estimate))
slope <- as.numeric(filter(fit.df, term=="education") %>%
                    select(estimate))

p <- ggplot(data, aes(x=education, y=prestige))
p + geom_jitter() + 
  geom_abline(intercept = inter, slope = slope)

Now run a regression model to predict the prestige of an occupation based on the level of education of people in that occupation (measured by the percentage of people in the field who have graduated from high school in 1950) and the occupation type (blue collar/white collar).

data <- mutate(data, bc = as.numeric(type=="bc"))
fit <- lm(prestige ~ education + bc, data=data)
fit.df <- tidy(fit)
print(fit.df)

##          term   estimate  std.error  statistic    p.value
## 1 (Intercept) -3.9676605 18.0368718 -0.2199750 0.82775099
## 2   education  0.6607208  0.2744876  2.4071063 0.02413129
## 3          bc  9.9913059 12.2655651  0.8145818 0.42332139

Make a plot showing the data and the model that you fit.

inter.wc <- as.numeric(filter(fit.df, term=="(Intercept)") %>%
                    select(estimate))
inter.bc <- inter.wc + as.numeric(filter(fit.df, term=="bc") %>%
                                  select(estimate))

slope <- as.numeric(filter(fit.df, term=="education") %>%
                    select(estimate))

p <- ggplot(data, aes(x=education, y=prestige, colour=type))
p + geom_jitter() + 
  scale_color_manual(values = c("blue", "white")) +
  geom_abline(intercept = inter.bc, slope = slope, col="blue") + 
  geom_abline(intercept = inter.wc, slope = slope, col="white")

Now run a regression model to predict occupational prestige based on the level of education and occupation type where the relationship between education and occupational prestige is allowed to vary by occupation type.

fit.2 <- lm(prestige ~ education + bc + bc:education, data=data)
tidy(fit.2)

##           term    estimate  std.error  statistic   p.value
## 1  (Intercept)  13.3646740 23.3880880  0.5714308 0.5732500
## 2    education   0.3788942  0.3661921  1.0346869 0.3115724
## 3           bc -16.2190248 25.7952166 -0.6287609 0.5357007
## 4 education:bc   0.6322738  0.5484924  1.1527485 0.2608540

Make a plot showing the data and the model that you fit.

fit <- lm(prestige ~ education + bc + bc:education, data=data)
fit.df <- tidy(fit)

inter.wc <- as.numeric(filter(fit.df, term=="(Intercept)") %>%
                         select(estimate))
inter.bc <- inter.wc + as.numeric(filter(fit.df, term=="bc") %>%
                                        select(estimate))
slope.wc <- as.numeric(filter(fit.df, term=="education") %>%
                      select(estimate))
slope.bc <- slope.wc + as.numeric(filter(fit.df, term=="education:bc") %>% 
                                          select(estimate))

p <- ggplot(data, aes(x=education, y=prestige, colour=type))
p + geom_jitter() + 
  scale_color_manual(values = c("blue", "white")) +
  geom_abline(intercept = inter.bc, slope = slope.bc, col="blue") + 
  geom_abline(intercept = inter.wc, slope = slope.wc, col="white")

In words, how would you summarize the conclusions from three models above?

In the simple model we see that occupations with higher levels of education are predicted to have higher prestige.  When we add a dummy variable for type, it appears that blue collar jobs are predicted to have higher prestige for a given level of education.  Finally, from the model with a dummy variable and an interaction, we estimate that predictions of occupational prestige seem to be more sensitive to the level of education than predictions for white collar jobs.  In other words, even this simple example is quite complicated.

The command below is helpful for debugging, please don’t change it

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] broom_0.3.5   ggplot2_1.0.0 dplyr_0.4.1  
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-4 DBI_0.3.1        digest_0.6.8    
##  [5] evaluate_0.5.5   formatR_1.0      grid_3.1.2       gtable_0.1.2    
##  [9] htmltools_0.2.6  knitr_1.9        labeling_0.3     lazyeval_0.1.10 
## [13] magrittr_1.5     MASS_7.3-35      mnormt_1.5-1     munsell_0.4.2   
## [17] parallel_3.1.2   plyr_1.8.1       proto_0.3-10     psych_1.5.1     
## [21] Rcpp_0.11.4      reshape2_1.4.1   rmarkdown_0.5.1  scales_0.2.4    
## [25] stringr_0.6.2    tidyr_0.2.0      tools_3.1.2      yaml_2.1.13

Homework 3: Teachers’ Union Needs Your Help

your name here

Due: Sunday, March 1, 2015

Occupational prestige

The command below is helpful for debugging, please don’t change it