14 March 2016

Objectives for the class

  • Assignment 3

  • Review

  • Intro to the general syntax for statistical modelling in R.

  • Specific examples using:

    • Normal linear regression

    • Logistic regression

    • Panel data

Assignment 3

Purpose: Gather, clean, and analyse data

Deadline: TBD You will submit a GitHub repo that:

  • Gathers web-based data from at least two sources. Cleans and merges the data so that it is ready for statistical analyses.

  • Conducts basic descriptive and inferential statistics with the data to address a relevant research question.

  • Briefly describes the results including with dynamically generated tables and figures.

  • Has a write up of 1,500 words maximum that describes the data gathering and analysis and uses literate programming.

Review

  • What is web scraping? What are some of tools R has for web scraping?

  • What are regular expressions (give at least two examples)? Why are they useful?

  • What dplyr function can you use to create a new variable in a data frame by running a command on values from groups in that data frame?

Statistical Modelling in R

Caveat: We are definitely not going to cover anywhere near R's full capabilities for statistical modeling.

We are also not going to cover all of the modeling concerns/diagnostics you need to consider when using a given model.

You will need to rely on your other stats courses and texts.

What are we going to do?

  • Discuss the basic syntax and capabilities in R for estimating normal linear and logistic regressions.

  • Basic model checking in R.

  • Discuss basic ways of interpreting results (we'll do more on this next week).

The basic model

Most statistical models you will estimate are from a general class (Generalised Linear Model) that has two parts:

Stocastic Component (e.g. randomly determined) assumes the dependent variable \(Y_{i}\) is generated from as a random draw from the probability density function:

\[ Y_{i} \sim f(\theta_{i},\: \alpha) \]

  • \(\theta_{i}\): parameter vector of the part of the function that varies between observations.

  • \(\alpha\): matrix of non-varying parameters.

Sometimes referred to as the 'error structure'.

The basic model

The Systematic Component indicating how \(\theta_{i}\) varies across observations depending on values of the explanatory variables and (often) some constant:

\[ \theta_{i} = g(X_{i},\: \beta) \]

  • \(X_{i}\): a \(1\: \mathrm{x}\: k\) vector of explanatory variables.

  • \(\beta\): a \(1\: \mathrm{x}\: k\) vector of parameters (i.e. coefficients).

  • \(g(.,.)\): the link function, specifying how the explanatory variables and parameters are translated into \(\theta_{i}\).

Today

Today we will cover two variations of this general model:

  • linear-normal regression (i.e. ordinary least squares)

  • logit model

Linear-normal regression

For continuous dependent variables assume that \(Y_{i}\) is from the normal distribution (\(N(.,.)\)).

Set the main parameter vector \(\theta_{i}\) to the scalar mean of: \(\theta_{i} = E(y_{i})= \mu_{i}\).

  • Scalar: a real number (in R-language: a vector of length 1)

Assume the ancillary parameter matrix is the scalar homoskedastic variance: \(\alpha = V(Y_{i}) = \sigma^2\).

  • Homoskedastic variance: variance does not depend on the value of \(x\). The standard deviation of the error terms is constant across values of \(x\).

Set the systematic component to the linear form: \(g(X_{i},\: \beta) = X_{i}\beta = \beta_{0} + X_{i1}\beta_{1} + \ldots\).

Linear-normal regression


So:

\[ Y_{i} \sim N(\mu_{i},\: \sigma^2), \:\:\: \mu_{i} = X_{i}\beta \]

Logit regression

For binary data (e.g. 0, 1) we can assume that the stochastic component has a Bernoulli distribution.

The main parameter is \(\pi_{i} = \mathrm{Pr}(Y_{i} = 1)\).

The systematic component is set to a logistic form: \(\pi_{i} = \frac{1}{1 + e^{-X_{i}\beta}}\).

So:

\[ Y_{i} \sim \mathrm{Bernoulli}(\pi_{i}), \:\:\: \pi_{i} = \frac{1}{1 + e^{-X_{i}\beta}} \]

Example error structure families and link functions

R syntax

The general syntax for estimating statistical models in R is:

response variable ~ explanatory variable(s)

Where '~' reads 'is modelled as a function of'.

In the Generalised Linear Model context, either explicitly or implicitly:

response variable ~ explanatory variable(s), family = error family

Model functions

We use model functions to specify the model structure.

Basic model functions include:

  • lm: fits a linear model where \(Y\) is assumed to be normally distributed and with homoskedastic variance.

  • glm: allows the fitting of many Generalised Linear Models. Lets you specify the family and the link function.

  • plm (package and function): panel data Linear Models

  • pglm (package and function): panel data Generalised Linear Models

Example of lm

Examine correlation matrix

car::scatterplotMatrix(Prestige)

Example of lm

Estimate simple model (education is in years):

M1 <- lm(prestige ~ education, data = Prestige)

summary(M1)
## 
## Call:
## lm(formula = prestige ~ education, data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0397  -6.5228   0.6611   6.7430  18.1636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -10.732      3.677  -2.919  0.00434 ** 
## education      5.361      0.332  16.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.103 on 100 degrees of freedom
## Multiple R-squared:  0.7228, Adjusted R-squared:   0.72 
## F-statistic: 260.8 on 1 and 100 DF,  p-value: < 2.2e-16

Confidence intervals of parameter point estimates

Note: Always prefer estimation intervals over point estimates.

Deal with your uncertainty!

About 95% of the time the population parameter will be within about 2 standard errors of the point estimate.

Using Central Limit Theorem (at least about 50 observations and the data is not extremely skewed):

\[ CI\_95 = \mathrm{point\: estimate} \pm 1.96 * SE \]

Confidence intervals of parameter point estimates

confint(M1)
##                  2.5 %    97.5 %
## (Intercept) -18.027220 -3.436744
## education     4.702223  6.019533

Example of lm

Estimate model with categorical (factor) variable:

M2 <- lm(prestige ~ education + type, 
         data = Prestige)

summary(M2)
## 
## Call:
## lm(formula = prestige ~ education + type, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.410  -5.508   1.360   5.694  17.171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.6982     5.7361  -0.470   0.6392    
## education     4.5728     0.6716   6.809 9.16e-10 ***
## typeprof      6.1424     4.2590   1.442   0.1526    
## typewc       -5.4585     2.6907  -2.029   0.0453 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.814 on 94 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.7975, Adjusted R-squared:  0.791 
## F-statistic: 123.4 on 3 and 94 DF,  p-value: < 2.2e-16

Create categorical variable from continuous variable

Use the cut function to create a categorical (factor) variable from a continuous variable.

Prestige$income_cat <- cut(Prestige$income,
                breaks = c(0, 4999, 9999, 14999, 30000),
                labels = c('< 5,000', '< 10,000', '< 15,000',
                            '>= 15,000'))
summary(Prestige$income_cat)
##   < 5,000  < 10,000  < 15,000 >= 15,000 
##        38        51         9         4

Note: cut excludes the left value and includes the right value, e.g. \((0,\: 4999]\).

Example of lm

M3 <- lm(prestige ~ education + income_cat, 
         data = Prestige)
confint(M3)
##                          2.5 %    97.5 %
## (Intercept)         -10.914031  3.182500
## education             3.350201  4.778444
## income_cat< 10,000    6.085375 13.030546
## income_cat< 15,000    9.584532 23.391854
## income_cat>= 15,000  12.040936 29.902733

Example of lm

Estimate models with polynomial transformations:

# Cubic polynomial transformation
M4  <- lm(prestige ~ education + poly(income, 2), 
          data = Prestige)
confint(M4)
##                       2.5 %    97.5 %
## (Intercept)       -1.470988  13.33552
## education          3.132827   4.48515
## poly(income, 2)1  45.174019  81.39445
## poly(income, 2)2 -43.150740 -12.87994

Example of lm

Estimate models with (natural) logarithmic transformations:

# Cubic polynomial transformation
M5  <- lm(prestige ~ education + log(income), 
          data = Prestige)

summary(M5)
## 
## Call:
## lm(formula = prestige ~ education + log(income), data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0346  -4.5657  -0.1857   4.0577  18.1270 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -95.1940    10.9979  -8.656 9.27e-14 ***
## education     4.0020     0.3115  12.846  < 2e-16 ***
## log(income)  11.4375     1.4371   7.959 2.94e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.145 on 99 degrees of freedom
## Multiple R-squared:  0.831,  Adjusted R-squared:  0.8275 
## F-statistic: 243.3 on 2 and 99 DF,  p-value: < 2.2e-16

Example of lm

Estimate model with interactions:

M6 <- lm(prestige ~ education * type, 
         data = Prestige)

summary(M6)
## 
## Call:
## lm(formula = prestige ~ education * type, data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.7095  -5.3938   0.8125   5.3968  16.1411 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.2936     8.6470  -0.497    0.621    
## education            4.7637     1.0247   4.649 1.11e-05 ***
## typeprof            18.8637    16.8881   1.117    0.267    
## typewc             -24.3833    21.7777  -1.120    0.266    
## education:typeprof  -0.9808     1.4495  -0.677    0.500    
## education:typewc     1.6709     2.0777   0.804    0.423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.827 on 92 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8012, Adjusted R-squared:  0.7904 
## F-statistic: 74.14 on 5 and 92 DF,  p-value: < 2.2e-16

Diagnose heteroscedasticity

Use plot on a model object to run visual diagnostics.

plot(M2, which = 1)

Diagnose non-normality of errors

plot to see if a model's errors are normally distributed.

plot(M2, which = 2)

Example of logistic regression with glm

Example from UCLA IDRE.

Simulated data of admission to grad school.

# Load data
URL <- 'http://www.ats.ucla.edu/stat/data/binary.csv'
Admission <- read.csv(URL)

Example of logistic regression with glm

car::scatterplotMatrix(Admission)

Contingency table for school rank and admission

admit_table <- xtabs(~admit + rank, data = Admission)
admit_table
##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

Row and column proportions

# Row proportions
prop.table(admit_table, margin = 1)
##      rank
## admit          1          2          3          4
##     0 0.10256410 0.35531136 0.34065934 0.20146520
##     1 0.25984252 0.42519685 0.22047244 0.09448819
# Column proportions
prop.table(admit_table, margin = 2)
##      rank
## admit         1         2         3         4
##     0 0.4590164 0.6423841 0.7685950 0.8208955
##     1 0.5409836 0.3576159 0.2314050 0.1791045

Summary of contingency table for school rank and admission

summary(admit_table)
## Call: xtabs(formula = ~admit + rank, data = Admission)
## Number of cases in table: 400 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 25.242, df = 3, p-value = 1.374e-05

Example of logistic regression with glm

Logit1 <- glm(admit ~ gre + gpa + as.factor(rank),
              data = Admission, family = 'binomial')

Note: Link function is assumed to be logit if family = 'binomial'.

Example of logistic regression with glm

confint(Logit1)
##                          2.5 %       97.5 %
## (Intercept)      -6.2716202334 -1.792547080
## gre               0.0001375921  0.004435874
## gpa               0.1602959439  1.464142727
## as.factor(rank)2 -1.3008888002 -0.056745722
## as.factor(rank)3 -2.0276713127 -0.670372346
## as.factor(rank)4 -2.4000265384 -0.753542605

Interpreting logistic regression results

\(\beta\)'s in logistic regression are interpretable as log odds. These are weird.

If we exponentiate log odds we get odds ratios.

exp(cbind(OddsRatio = coef(Logit1), confint(Logit1)))
##                  OddsRatio       2.5 %    97.5 %
## (Intercept)      0.0185001 0.001889165 0.1665354
## gre              1.0022670 1.000137602 1.0044457
## gpa              2.2345448 1.173858216 4.3238349
## as.factor(rank)2 0.5089310 0.272289674 0.9448343
## as.factor(rank)3 0.2617923 0.131641717 0.5115181
## as.factor(rank)4 0.2119375 0.090715546 0.4706961

These are also weird.

Interpreting logistic regression results

What we really want are predicted probabilities

First create a data frame of fitted values:

fitted <- with(Admission, 
               data.frame(gre = mean(gre), 
                          gpa = mean(gpa),
                          rank = factor(1:4)))
fitted
##     gre    gpa rank
## 1 587.7 3.3899    1
## 2 587.7 3.3899    2
## 3 587.7 3.3899    3
## 4 587.7 3.3899    4

Interpreting logistic regression results

Second predict probability point estimates for each fitted value.

fitted$predicted <- predict(Logit1, newdata = fitted,
                            type = 'response')

fitted
##     gre    gpa rank predicted
## 1 587.7 3.3899    1 0.5166016
## 2 587.7 3.3899    2 0.3522846
## 3 587.7 3.3899    3 0.2186120
## 4 587.7 3.3899    4 0.1846684

More interpretation

Next class we will explore other methods of interpreting results from regression models.

Seminar: modeling



Begin working on the statistical models for your project.

and/or

Out of Lecture Challenge: Estimate a normal regression model and plot predicted values with uncertainty across a range of fitted values.

Functions you might find useful:

  • coef and confint