We start first by loading up the following packages: HistData, car, and stargazer.
#load up necessary packages
library(HistData)
library(car)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.1. https://CRAN.R-project.org/package=stargazer
We want to determine a straight line that determines the relationship between x, some independent (predictor) variable, and y, an outcome or dependent variable.
You will recall from high school algebra that the equation for a straight line is:
\[\begin{equation} y = \alpha + \beta x \end{equation}\]where \(\alpha\) is the y-intercept (the value of y when x=0) and \(\beta\) is the slope of the line (the amount that y will increase by with every unit increase in x).
A classic example is the height data from Sir Francis Galton, a 19th century statistician. He collected height data on 905 children born to 205 parents. The dataset is part of the R package HistData, which contains Data Sets from the History of Statistics and Data Visualization. The variables are the height of the child (as an adult) and the mid-parent height, that is the average of the parents’ heights.
If you go to the Environment tab, you will see the Global Environment. Click on “package:HistData” and you will see the Galton dataset.
Let’s see what is in there:
# see what is in the Galton dataset
summary(Galton)
## parent child
## Min. :64.00 Min. :61.70
## 1st Qu.:67.50 1st Qu.:66.20
## Median :68.50 Median :68.20
## Mean :68.31 Mean :68.09
## 3rd Qu.:69.50 3rd Qu.:70.20
## Max. :73.00 Max. :73.70
Let’s develop the model on ~90% of the dataset, then test it on the remaining ~10% of the data. In the datascience world, we call that first step “training” the model.
# divide the dataset into a training and a testing set based on a random uniform number on fixed seed
set.seed(20170208)
Galton$group <- runif(length(Galton$parent), min = 0, max = 1)
summary(Galton)
## parent child group
## Min. :64.00 Min. :61.70 Min. :0.002463
## 1st Qu.:67.50 1st Qu.:66.20 1st Qu.:0.245599
## Median :68.50 Median :68.20 Median :0.518798
## Mean :68.31 Mean :68.09 Mean :0.501614
## 3rd Qu.:69.50 3rd Qu.:70.20 3rd Qu.:0.743939
## Max. :73.00 Max. :73.70 Max. :0.999420
Galton.train <- subset(Galton, group <= 0.90)
Galton.test <- subset(Galton, group > 0.90)
Now let’s graph our training set data.
#graph child on parent for testing dataset
plot(child ~ parent, data = Galton.test)
Let’s do the regression now on the training set. Linear regression is performed with the R function “lm” and takes the form
object.name <- lm(y ~ x, data = data_set_name)
Let’s do that now with the Galton data:
# linear regression of child height on mid-parent height in the training dataset
reg1 <- lm(child ~ parent, data = Galton.train)
summary(reg1)
##
## Call:
## lm(formula = child ~ parent, data = Galton.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7923 -1.3502 0.0604 1.6498 5.9445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.85366 2.99886 7.954 5.9e-15 ***
## parent 0.64736 0.04389 14.751 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.249 on 829 degrees of freedom
## Multiple R-squared: 0.2079, Adjusted R-squared: 0.2069
## F-statistic: 217.6 on 1 and 829 DF, p-value: < 2.2e-16
Interpretation: for each increase in parent height of 1 inch, the child height increases by 0.65 inches.
Now the way that this is working is to estimate values for \(\alpha\) and \(\beta\) such that when you plug in your given independent variables, you get predicted dependent variables that are close to the observed values. In statistics we optimize this closeness by minimizing the sum-of-squared-residuals, that is
\[\begin{equation} \sum\limits_{i=1}^n (Y_{obs.i} - Y_{pred.i})^2 \end{equation}\]So let’s look at how we did. First let’s calculate the observed and predicted values in the training and testing datasets.
# get predicted values in the training and testing dataset
Galton.train$pred.child <- predict(reg1, newdata = Galton.train)
Galton.test$pred.child <- predict(reg1, newdata=Galton.test)
# calculate residuals in the training and testing dataset
Galton.train$resid <- Galton.train$child - Galton.train$pred.child
Galton.test$resid <- Galton.test$child - Galton.test$pred.child
Now that we have calculated these values, let’s look at some simple plots. The Companion to Applied Regression (aka car) package, has some good functionality for this
library(car)
#get the residual plots
residualPlots(reg1)
## Test stat Pr(>|t|)
## parent 1.459 0.145
## Tukey test 1.459 0.144
Let’s look at how well we do in the testing dataset. First let’s plot the data.
#plot test dataset
plot(child ~ parent, data = Galton.test)
#overlay the regression line
abline(a=23.85, b=0.65)
Now let’s look at the residual plots:
#plot residual versus X
plot(resid~parent, data = Galton.test)
#overlay the zero line
abline(0,0)
#plot residual v predicted
plot(resid~pred.child, data = Galton.test)
One assumption in the statistical analysis of a linear regression are the hypothesis tests. We usually want to test a null hypothesis that the slope is 0, ie, there is no relationship between y and x. We express this mathematically as
\[\begin{equation} H_0: \beta = 0 \end{equation}\] We test a null hypothesis against an alternative hypothesis, ie, the slope is not equal to 0, ie, there is some (positive or negative) relationship between y and x. We express this mathematically as \[\begin{equation} H_A: \beta \neq 0 \end{equation}\]As you have seen in these data, the y-values do not line up as a perfect linear function of x, ie child height does not line up as a perfect linear function of parent height. There is an error that occurs for each observation. So we are really modeling a statistical model
\[\begin{equation} Y = \alpha + \beta x + \epsilon \end{equation}\]Recall from above that we estimate \(\alpha\) and \(\beta\) to minimize the sum of squared residuals.
To do a statistical test on the slopes we have to make some assumptions. One of the assumptions is that \(\epsilon\) ~ N(0, \(\sigma^2\)), that is, that the errors between the observed and predicted values of Y take on a normal distribution with mean of 0 and variance of \(\sigma^2\).
We can formally test this assumption, but we can also do a qq-plot which will give us a visual representation of the same. We get this plot as follows:
qqPlot(reg1)
In this plot we have taken each residual and divided by the estimated standard deviation to create studentized residuals. We then rank them and calculate the percentile for each studentized residual, then create this graph. Of course, the car package is doing all of this under the hood, so to speak.
To do the significance test, recall the summary of the linear regression model we had above.
##
## Call:
## lm(formula = child ~ parent, data = Galton.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7923 -1.3502 0.0604 1.6498 5.9445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.85366 2.99886 7.954 5.9e-15 ***
## parent 0.64736 0.04389 14.751 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.249 on 829 degrees of freedom
## Multiple R-squared: 0.2079, Adjusted R-squared: 0.2069
## F-statistic: 217.6 on 1 and 829 DF, p-value: < 2.2e-16
Note that there are two estimated coefficients, so our estimated line is
\[\begin{equation} child = 23.85 + 0.65*parent \end{equation}\]You will also notice that next to the estimate, there is a column for standard error, a column for t-value, and column for p-value. We divide the estimate by the standard error to compute the t-value, which then has 829 degrees of freedom. For the slope estimate, we calculate the the p-value is < 2e-16, ie, highly significantly different from 0.
At the bottom of summary you will notice a line for the F-statistics, which is the ratio of the mean-squared error of the model to the mean-squared error of the residuals. This has an F-distribution with 1 and 829 degrees of freedom, and takes on the value of 217.6 which has a p-value < 2e-16. Since this is a univariate regression you will see that if you take the value of the t-statistic for the slope and square it, that will give you the value of the F-statistic.
Isn’t math fun?
Suppose you have more than one independent variable that you think explains the dependent variable. That is instead of the simple univariate case of
\[\begin{equation} y = \alpha + \beta x \end{equation}\]You have the multivariate case of
\[\begin{equation} y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \end{equation}\]As an example consider the PRESTIGE dataset (which comes with the car package). It consists of 102 observations and 6 variables as follows:
education: The average number of years of education for occupational incumbents.
income: The average income of occupational incumbents, in dollars.
women: The percentage of women in the occupation.
prestige: The average prestige rating for the occupation.
census: The code of the occupation used in the survey.
type: Professional and managerial(prof), white collar(wc), blue collar(bc), or missing(NA).
We want to determine how the prestige of an occupation is related to average income, education (average number of years for incumbents), and the percentage of women in the occupation.
Let’s explore first:
#explore the 5 m's for the Prestige dataset
summary(Prestige)
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
Now let’s see some plots:
#produce a scatterplot matrix
scatterplotMatrix(~ prestige + income +education + women, span =0.7, data = Prestige)
Hmmm…the variable income looks kind of wonky, for lack of a better term. So let’s see if a transformation will help:
scatterplotMatrix(~ prestige + log2(income) +education + women, span =0.7, data = Prestige)
Well, that looks a little better.
Now let’s see what the regression of prestige on income (logged), education, and women looks like:
# let the regression rip
prestige.mod1 <- lm(prestige ~ education + log2(income) + women, data= Prestige)
#and see what we have
summary(prestige.mod1)
##
## Call:
## lm(formula = prestige ~ education + log2(income) + women, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.364 -4.429 -0.101 4.316 19.179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -110.9658 14.8429 -7.476 3.27e-11 ***
## education 3.7305 0.3544 10.527 < 2e-16 ***
## log2(income) 9.3147 1.3265 7.022 2.90e-10 ***
## women 0.0469 0.0299 1.568 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.093 on 98 degrees of freedom
## Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
## F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
From this output we see that if education increases by 1 year, then average prestige rating will increase by 3.73 units, with income and women held constant.
For education we note that the p-value is < 2e-16.
So we interpret that to reject the null hypothesis H_0: \(\beta_1\) = 0.
How do interpret the contributions of log2(income) and women?
Another point to remark: the multiple R-squared is 0.84, indicating that 84% of the variability in average prestige rating is due to these three independent variables.
Finally, the F-statistic of 165.4 is for testing the null hypothesis H_0: \(\beta_1 = \beta_2 = \beta_3 = 0\). It is highly significant. (and that “squaring the t to get the F” trick only works for the univariate regression case…)
Since the p-value for women does not reach the traditional significance level of 0.05, we might consider removing it from our model. Let’s see what happens then…
#run the model with only two predictors
prestige.mod2 <- lm(prestige ~ education + log2(income), data= Prestige)
# look at the results
summary(prestige.mod2)
##
## Call:
## lm(formula = prestige ~ education + log2(income), data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0346 -4.5657 -0.1857 4.0577 18.1270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -95.1940 10.9979 -8.656 9.27e-14 ***
## education 4.0020 0.3115 12.846 < 2e-16 ***
## log2(income) 7.9278 0.9961 7.959 2.94e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.145 on 99 degrees of freedom
## Multiple R-squared: 0.831, Adjusted R-squared: 0.8275
## F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16
Hmmmm….education and log2(income) remain highly significant, there is little reduction in R-squared, the model is still significant.
We can compare the results of these two analyses a little more cleanly using the stargazer package as follows:
# compare the results of the two regression models
stargazer(prestige.mod1, prestige.mod2, title="Comparison of 2 Regression outputs",
type = "html", align=TRUE)
Dependent variable: | ||
prestige | ||
(1) | (2) | |
education | 3.731*** | 4.002*** |
(0.354) | (0.312) | |
log2(income) | 9.315*** | 7.928*** |
(1.327) | (0.996) | |
women | 0.047 | |
(0.030) | ||
Constant | -110.966*** | -95.194*** |
(14.843) | (10.998) | |
Observations | 102 | 102 |
R2 | 0.835 | 0.831 |
Adjusted R2 | 0.830 | 0.828 |
Residual Std. Error | 7.093 (df = 98) | 7.145 (df = 99) |
F Statistic | 165.428*** (df = 3; 98) | 243.323*** (df = 2; 99) |
Note: | p<0.1; p<0.05; p<0.01 |
We conclude that we lose little by eliminating a predictor variable. This is also consistent with a principle of parsimony, also known as the KISS principle, or “Keep It Simple, Silly”.
Let’s finish with some diagnostics. First the plots for the firs model with 3 independent variables:
# diagnostics for the first model with 3 independent variables
residualPlots(prestige.mod1)
## Test stat Pr(>|t|)
## education 1.756 0.082
## log2(income) 0.585 0.560
## women 2.277 0.025
## Tukey test 0.763 0.445
Notice that there is a non-zero trend for the variable women.
And now the plots for the second model with 2 independent variables
# diagnostics for the second model with 2 independent variables
residualPlots(prestige.mod2)
## Test stat Pr(>|t|)
## education 1.615 0.109
## log2(income) 0.221 0.825
## Tukey test 0.653 0.514
Now the plots look really good, much better.
Another diagnostic tool is the added variable plot, that is the additional benefit of variable i given that all of the others are in. In this particular plot we can also identify the most influential observations.
#added variable plots
avPlots(prestige.mod1, id.n=2, id.cex=0.7)
#id.n - identify n most influential observations
#id.cex - controls the size of the dot
Let’s run the qq-plot:
# run the qq-plot
qqPlot(prestige.mod1, id.n=3)
## collectors ministers farmers
## 1 101 102
# here, id.n identifies the n observations with the largest residuals in absolute value
Are there any outliers?
#run Bonferroni test for outliers
outlierTest(prestige.mod1)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## farmers 2.857462 0.0052259 0.53305
Are there any points that are of high influence?
#identify highly influential points
influenceIndexPlot(prestige.mod1, id.n=3)
NB. If there are points that are a) outliers AND b) highly influential, these have potential to change the inference. You should consider removing them.
How do we make heads or tails out of the plots above? One way is with an influence plot.
#make influence plot
influencePlot(prestige.mod1, id.n=3)
## StudRes Hat CookD
## ministers 2.4211475 0.10072159 0.15638036
## collectors -2.5320314 0.01352116 0.02081914
## newsboys -0.3534964 0.28595843 0.01262364
## babysitters 1.7227767 0.19703063 0.17848347
## farmers 2.8574618 0.03896096 0.07711591
Another diagnostic is to test for heteroskedasticity (i.e., the variance of the error term is not constant).
#test for heteroskedasticity
ncvTest(prestige.mod1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.02841389 Df = 1 p = 0.8661394
We also want to look for multicollinearity, that is are some of our independent variables highly correlated. We do this by looking at the Variance Inflation Factor (VIF). A GVIF > 4 suggests collinearity.
vif(prestige.mod1)
## education log2(income) women
## 1.877097 2.572283 1.806431