Assignment 3
Review
Intro to the general syntax for statistical modelling in R.
Specific examples using:
Normal linear regression
Logistic regression
Panel data
18 November 2016
Assignment 3
Review
Intro to the general syntax for statistical modelling in R.
Specific examples using:
Normal linear regression
Logistic regression
Panel data
Purposes: Pose an interesting research question and try to answer it using data analysis and standard academic practices. Effectively communicate your results to a variety of audiences in a variety of formats.
Deadline:
Presentation: In-class 2 December
Website/Paper: 16 December 2016
The project can be thought of as a 'dry run' for your thesis with multiple presentation outputs.
Presentation: 10 minutes maximum. Engagingly present your research question and key findings to a general academic audience (fellow students).
Paper: 5,000 words maximum. Standard academic paper, properly cited laying out your research question, literature review, data, methods, and findings.
Website: An engaging website designed to convey your research to a general audience.
As always, you should submit one GitHub repository with all of the materials needed to completely reproduce your data gathering, analysis, and presentation documents.
Note: Because you've had two assignments already to work on parts of the project, I expect high quality work.
Find one other group to be a discussant for your presentation.
The discussants will provide a quick (max 2 minute) critique of your presentation–ideas for things you can improve on your paper.
I will have normal office hours every week for the rest of the term except 9 December.
Please take advantages of this opportunity to improve your final project.
What is web scraping? What are some of tools R has for web scraping?
What are regular expressions (give at least two examples)? Why are they useful?
What dplyr function can you use to create a new variable in a data frame by running a command on values from groups in that data frame?
What is the data-ink ratio? Why is it important for effective plotting.
Why should you avoid using the size of circles to have meaning about continuous variables?
Why not use red-green colour contrasts to indicate contrasting data?
How many decimal places should you report in a table and why?
Caveat: We are definitely not going to cover anywhere near R's full capabilities for statistical modeling.
We are also not going to cover all of the modeling concerns/diagnostics you need to consider when using a given model.
You will need to rely on your other stats courses and texts.
Discuss the basic syntax and capabilities in R for estimating normal linear and logistic regressions.
Basic model checking in R.
Discuss basic ways of interpreting and presenting results.
Most statistical models you will estimate are from a general class (Generalised Linear Model) that has two parts:
Stocastic Component (e.g. randomly determined) assumes the dependent variable \(Y_{i}\) is generated from as a random draw from the probability density function:
\[ Y_{i} \sim f(\theta_{i},\: \alpha) \]
\(\theta_{i}\): parameter vector of the part of the function that varies between observations.
\(\alpha\): matrix of non-varying parameters.
Sometimes referred to as the 'error structure'.
The Systematic Component indicating how \(\theta_{i}\) varies across observations depending on values of the explanatory variables and (often) some constant:
\[ \theta_{i} = g(X_{i},\: \beta) \]
\(X_{i}\): a \(1\: \mathrm{x}\: k\) vector of explanatory variables.
\(\beta\): a \(1\: \mathrm{x}\: k\) vector of parameters (i.e. coefficients).
\(g(.,.)\): the link function, specifying how the explanatory variables and parameters are translated into \(\theta_{i}\).
Today we will cover two variations of this general model:
linear-normal regression (i.e. ordinary least squares)
logit model
For continuous dependent variables assume that \(Y_{i}\) is from the normal distribution (\(N(.,.)\)).
Set the main parameter vector \(\theta_{i}\) to the scalar mean of: \(\theta_{i} = E(y_{i})= \mu_{i}\).
Assume the ancillary parameter matrix is the scalar homoskedastic variance: \(\alpha = V(Y_{i}) = \sigma^2\).
Set the systematic component to the linear form: \(g(X_{i},\: \beta) = X_{i}\beta = \beta_{0} + X_{i1}\beta_{1} + \ldots\).
So:
\[ Y_{i} \sim N(\mu_{i},\: \sigma^2), \:\:\: \mu_{i} = X_{i}\beta \]
For binary data (e.g. 0, 1) we can assume that the stochastic component has a Bernoulli distribution.
The main parameter is \(\pi_{i} = \mathrm{Pr}(Y_{i} = 1)\).
The systematic component is set to a logistic form: \(\pi_{i} = \frac{1}{1 + e^{-X_{i}\beta}}\).
So:
\[ Y_{i} \sim \mathrm{Bernoulli}(\pi_{i}), \:\:\: \pi_{i} = \frac{1}{1 + e^{-X_{i}\beta}} \]
Error Family | Canonical link |
---|---|
Normal | identity |
binomial | logit |
poisson | log |
The general syntax for estimating statistical models in R is:
response variable ~ explanatory variable(s)
Where '~
' reads 'is modelled as a function of'.
In the Generalised Linear Model context, either explicitly or implicitly:
response variable ~ explanatory variable(s), family = error family
We use model functions to specify the model structure.
Basic model functions include:
lm
: fits a linear model where \(Y\) is assumed to be normally distributed and with homoskedastic variance.
glm
: allows the fitting of many Generalised Linear Models. Lets you specify the error family
.
plm
(package and function): panel data Linear Models
pglm
(package and function): panel data Generalised Linear Models
lm
Example data Prestige (example based on http://www.princeton.edu/~otorres/Regression101R.pdf).
The observations are occupations and the dependent variable is a score of each occupation's prestige.
library(car) data(Prestige)
car::scatterplotMatrix(Prestige)
lm
Estimate simple model (education is in years):
M1 <- lm(prestige ~ education, data = Prestige)
summary(M1)
## ## Call: ## lm(formula = prestige ~ education, data = Prestige) ## ## Residuals: ## Min 1Q Median 3Q Max ## -26.0397 -6.5228 0.6611 6.7430 18.1636 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -10.732 3.677 -2.919 0.00434 ** ## education 5.361 0.332 16.148 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.103 on 100 degrees of freedom ## Multiple R-squared: 0.7228, Adjusted R-squared: 0.72 ## F-statistic: 260.8 on 1 and 100 DF, p-value: < 2.2e-16
Note: Always prefer estimation intervals over point estimates.
Deal with your uncertainty!
About 95% of the time the population parameter will be within about 2 standard errors of the point estimate.
Using Central Limit Theorem (at least about 50 observations and the data is not extremely skewed):
\[ CI\_95 = \mathrm{point\: estimate} \pm 1.96 * SE \]
confint(M1)
## 2.5 % 97.5 % ## (Intercept) -18.027220 -3.436744 ## education 4.702223 6.019533
lm
Estimate model with categorical (factor) variable:
M2 <- lm(prestige ~ education + type, data = Prestige)
summary(M2)
## ## Call: ## lm(formula = prestige ~ education + type, data = Prestige) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.410 -5.508 1.360 5.694 17.171 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2.6982 5.7361 -0.470 0.6392 ## education 4.5728 0.6716 6.809 9.16e-10 *** ## typeprof 6.1424 4.2590 1.442 0.1526 ## typewc -5.4585 2.6907 -2.029 0.0453 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.814 on 94 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.7975, Adjusted R-squared: 0.791 ## F-statistic: 123.4 on 3 and 94 DF, p-value: < 2.2e-16
Use the cut
function to create a categorical (factor) variable from a continuous variable.
Prestige$income_cat <- cut(Prestige$income, breaks = c(0, 4999, 9999, 14999, 30000), labels = c('< 5,000', '< 10,000', '< 15,000', '>= 15,000')) summary(Prestige$income_cat)
## < 5,000 < 10,000 < 15,000 >= 15,000 ## 38 51 9 4
Note: cut
excludes the left value and includes the right value, e.g. \((0,\: 4999]\).
lm
M3 <- lm(prestige ~ education + income_cat, data = Prestige) confint(M3)
## 2.5 % 97.5 % ## (Intercept) -10.914031 3.182500 ## education 3.350201 4.778444 ## income_cat< 10,000 6.085375 13.030546 ## income_cat< 15,000 9.584532 23.391854 ## income_cat>= 15,000 12.040936 29.902733
lm
Estimate models with polynomial transformations:
# Cubic polynomial transformation M4 <- lm(prestige ~ education + poly(income, 2), data = Prestige) confint(M4)
## 2.5 % 97.5 % ## (Intercept) -1.470988 13.33552 ## education 3.132827 4.48515 ## poly(income, 2)1 45.174019 81.39445 ## poly(income, 2)2 -43.150740 -12.87994
lm
Estimate models with (natural) logarithmic transformations:
# Cubic polynomial transformation M5 <- lm(prestige ~ education + log(income), data = Prestige)
summary(M5)
## ## Call: ## lm(formula = prestige ~ education + log(income), data = Prestige) ## ## Residuals: ## Min 1Q Median 3Q Max ## -17.0346 -4.5657 -0.1857 4.0577 18.1270 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -95.1940 10.9979 -8.656 9.27e-14 *** ## education 4.0020 0.3115 12.846 < 2e-16 *** ## log(income) 11.4375 1.4371 7.959 2.94e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.145 on 99 degrees of freedom ## Multiple R-squared: 0.831, Adjusted R-squared: 0.8275 ## F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16
lm
Estimate model with interactions:
M6 <- lm(prestige ~ education * type, data = Prestige)
summary(M6)
## ## Call: ## lm(formula = prestige ~ education * type, data = Prestige) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.7095 -5.3938 0.8125 5.3968 16.1411 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.2936 8.6470 -0.497 0.621 ## education 4.7637 1.0247 4.649 1.11e-05 *** ## typeprof 18.8637 16.8881 1.117 0.267 ## typewc -24.3833 21.7777 -1.120 0.266 ## education:typeprof -0.9808 1.4495 -0.677 0.500 ## education:typewc 1.6709 2.0777 0.804 0.423 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.827 on 92 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.8012, Adjusted R-squared: 0.7904 ## F-statistic: 74.14 on 5 and 92 DF, p-value: < 2.2e-16
Use plot
on a model object to run visual diagnostics.
plot(M2, which = 1)
plot
to see if a model's errors are normally distributed.
plot(M2, which = 2)
glm
Example from UCLA IDRE.
Simulated data of admission to grad school.
# Load data URL <- 'http://www.ats.ucla.edu/stat/data/binary.csv' Admission <- read.csv(URL)
glm
car::scatterplotMatrix(Admission)
admit_table <- xtabs(~admit + rank, data = Admission) admit_table
## rank ## admit 1 2 3 4 ## 0 28 97 93 55 ## 1 33 54 28 12
# Row proportions prop.table(admit_table, margin = 1)
## rank ## admit 1 2 3 4 ## 0 0.10256410 0.35531136 0.34065934 0.20146520 ## 1 0.25984252 0.42519685 0.22047244 0.09448819
# Column proportions prop.table(admit_table, margin = 2)
## rank ## admit 1 2 3 4 ## 0 0.4590164 0.6423841 0.7685950 0.8208955 ## 1 0.5409836 0.3576159 0.2314050 0.1791045
summary(admit_table)
## Call: xtabs(formula = ~admit + rank, data = Admission) ## Number of cases in table: 400 ## Number of factors: 2 ## Test for independence of all factors: ## Chisq = 25.242, df = 3, p-value = 1.374e-05
glm
Logit1 <- glm(admit ~ gre + gpa + as.factor(rank), data = Admission, family = 'binomial')
Note: Link function is assumed to be logit if family = 'binomial'
.
glm
confint(Logit1)
## 2.5 % 97.5 % ## (Intercept) -6.2716202334 -1.792547080 ## gre 0.0001375921 0.004435874 ## gpa 0.1602959439 1.464142727 ## as.factor(rank)2 -1.3008888002 -0.056745722 ## as.factor(rank)3 -2.0276713127 -0.670372346 ## as.factor(rank)4 -2.4000265384 -0.753542605
\(\beta\)'s in logistic regression are interpretable as log odds. These are weird.
If we exponentiate log odds we get odds ratios.
exp(cbind(OddsRatio = coef(Logit1), confint(Logit1)))
## OddsRatio 2.5 % 97.5 % ## (Intercept) 0.0185001 0.001889165 0.1665354 ## gre 1.0022670 1.000137602 1.0044457 ## gpa 2.2345448 1.173858216 4.3238349 ## as.factor(rank)2 0.5089310 0.272289674 0.9448343 ## as.factor(rank)3 0.2617923 0.131641717 0.5115181 ## as.factor(rank)4 0.2119375 0.090715546 0.4706961
These are also weird.
What we really want are predicted probabilities
First create a data frame of fitted values:
fitted <- with(Admission, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4))) fitted
## gre gpa rank ## 1 587.7 3.3899 1 ## 2 587.7 3.3899 2 ## 3 587.7 3.3899 3 ## 4 587.7 3.3899 4
Second predict probability point estimates for each fitted value.
fitted$predicted <- predict(Logit1, newdata = fitted, type = 'response') fitted
## gre gpa rank predicted ## 1 587.7 3.3899 1 0.5166016 ## 2 587.7 3.3899 2 0.3522846 ## 3 587.7 3.3899 3 0.2186120 ## 4 587.7 3.3899 4 0.1846684
King, Tomz, and Wittenberg (2001) argue that post-estimation simulations can be used to effectively communicate results from regression models.
Estimate our parameters' point estimates for \(\hat{\beta}_{1\ldots k}\).
Draw \(n\) values of the point estimates from multivariate normal distributions with means \(\bar{\beta}_{1\ldots k}\) and variances specified by the parameters' estimated co-variance.
Use the simulated values to calculate quantities of interest (e.g. predicted probabilities).
Plot the simulated distribution using visual weighting.
Post-estimation simulations allow us to effectively communicate our estimates and the uncertainty around them.
This method is broadly similar to a fully Bayesian approach with Markov-Chain Monte Carlo or bootstrapping. Just differ on how the parameters are drawn.
Find the coefficient estimates from an estimated model with coef
.
Find the co-variance matrix with vcov
.
Draw point estimates from the multivariate normal distribution with mvrnorm
.
Calculate the quantity of interest with the draws + fitted values using and plot the results.
First estimate your model as normal and create fitted values:
library(car) # Contains data library(dplyr) # Piping function M_prest <- lm(prestige ~ education + type, data = Prestige) # Find a range of education values range(Prestige$education)
## [1] 6.38 15.97
edu_range <- 6:16
Extract point estimates (coefficients) and co-variance matrix:
mp_coef <- matrix(coef(M_prest)) mp_vcov <- vcov(M_prest)
Now draw 1,000 simulations of your point estimates:
library(MASS) # contains the mvrnorm function drawn <- mvrnorm(n = 1000, mu = mp_coef, Sigma = mp_vcov) %>% data.frame head(drawn)[1:3, ]
## X.Intercept. education typeprof typewc ## 1 -1.963977 4.378190 8.449935 -2.748082 ## 2 1.867675 4.157990 9.968692 -5.090633 ## 3 -4.213891 4.840518 4.676530 -6.004704
Now we can add in our fitted values to the simulation data frame:
drawn_sim <- merge(drawn, edu_range) # Rename the fitted value variable drawn_sim <- dplyr::rename(drawn_sim, fitted_edu = y) nrow(drawn)
## [1] 1000
nrow(drawn_sim)
## [1] 11000
Using the normal linear regression formula (\(\hat{y_{i}} = \hat{\alpha} + X_{i1}\hat{\beta_{1}} + \ldots\)) we can find the quantity of interest for white collar workers:
names(drawn_sim)
## [1] "X.Intercept." "education" "typeprof" "typewc" ## [5] "fitted_edu"
drawn_sim$sim_wc <- drawn_sim[, 1] + drawn_sim[, 2] * drawn_sim[, 5] + drawn_sim[, 3]
ggplot(drawn_sim, aes(fitted_edu, sim_wc)) + geom_point(alpha = 0.2) + stat_smooth(se = FALSE) + theme_bw()
# Find 95% central interval and median at each fitted value of edu central <- drawn_sim %>% group_by(fitted_edu) %>% summarise(median_sim = median(sim_wc), lower_95 = quantile(sim_wc, probs = 0.025), upper_95 = quantile(sim_wc, probs = 0.975) )
ggplot(central, aes(fitted_edu, median_sim)) + geom_ribbon(aes(ymin = lower_95, ymax = upper_95), alpha = 0.3) + geom_line() + theme_bw()
Use the same steps for simulating predicted outcomes from logistic regression models. The only difference is that the equation for the quantity of interest is:
\[ P(y_{i} = 1) = \frac{\mathrm{exp}(\hat{\alpha} + X_{i1}\hat{\beta_{1}} + \ldots)}{1 - \mathrm{exp}(\hat{\alpha} + X_{i1}\hat{\beta_{1}} + \ldots)} \]
The Zelig package streamlines the simulation process.
Note: There appears to be a bug in Zelig version 5.0-13 when dealing with factor-level explanatory variables. So, use version 5.0-11:
devtools::install_version('Zelig', version = '5.0-11')
First estimate your regression model using zelig
.
library(Zelig) # Have to explicitly declare rank as factor Admission$rank <- as.factor(Admission$rank) Z1 <- zelig(admit ~ gre + gpa + rank, cite = FALSE, data = Admission, model = 'logit')
Then set the fitted values with setx
.
setZ1 <- setx(Z1, gre = 220:800)
And run the simulations (1,000 by default) with sim
.
simZ1 <- sim(Z1, x = setZ1)
Plot:
ci.plot(simZ1)
Begin working on the statistical models for your project.
and/or
Out of Lecture Challenge: Estimate a normal regression model and plot predicted values across a range of fitted values. Bonus: do so with a measure of uncertainty.
King, Gary, Micheal Tomz, and Jason Wittenberg. 2001. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 22 (4): 341–255.