Blake Nicholson, Seelio

March 12, 2015

- Dimension reduction technique
- Example use case: regression in the presence of multicollinearity
- \(\textbf{PC} = \mathbf{\alpha x}\)
- Iteratively select principal component that accounts for the maximum variance

- Latent variable identification
- Example use case: indirectly measure concepts such as intelligence that cannot be measured directly
- \(\mathbf{x} = \mathbf{\Lambda f + u}\)
- Select \(\mathbf{\Lambda}\) to best fit the sample covariance matrix

- Primary purpose is as a dimension reduction technique
- Goal is to transform a set of correlated variables into a much smaller set of uncorrelated variables
- Each principal component is a linear combination of the input variables:

\[ \text{PC}_1 = \alpha_{11}x_1 + \alpha_{12}x_2 + \cdot\cdot\cdot + \alpha_{1n}x_n \] \[\vdots\] \[ \text{PC}_k = \alpha_{k1}x_1 + \alpha_{k2}x_2 + \cdot\cdot\cdot + \alpha_{kn}x_n, \]

where \(k << n\)

- The principal components are determined iteratively, starting with \(\text{PC}_1\), then \(\text{PC}_2\), etc. The coefficients above are chosen to maximize the variance explained while satisfying the requirement that the components be uncorrelated.

- Run
`prcomp`

(or`princomp`

) - Examine output, summary, and plot of (1)
- Determine how many components to keep
- Based upon topical knowledge/experience
- Account for some threshold cumulative proportion of variance
- Locate the “elbow” of the scree plot
- Keep all principal components with above average variance (variance > \(1\) when working with scaled data)

```
library(ggplot2)
qplot(data=cars, x=speed, y=dist)
```

```
pca <- prcomp(cars)
print(pca)
```

```
## Standard deviations:
## [1] 26.12524 3.08084
##
## Rotation:
## PC1 PC2
## speed -0.1656479 -0.9861850
## dist -0.9861850 0.1656479
```

- “Standard deviations” shows the square roots of the eigenvalues
- “Rotation” shows the eigvenvectors (these provide the \(\alpha\) coefficients shown earlier in the presentation)
- You can confirm by computing the inner product that the two principal components shown in the “Rotation” output are indeed orthogonal

`print(summary(pca))`

```
## Importance of components:
## PC1 PC2
## Standard deviation 26.1252 3.08084
## Proportion of Variance 0.9863 0.01372
## Cumulative Proportion 0.9863 1.00000
```

Principal component 1 accounts for nearly all of the variance due to the fact that the variances of the two columns are quite different. In general, one should scale their data to have unit variance by passing “scale.=TRUE” to prcomp.

`plot(pca)`

Normally, you would be considering more than two components and the scree plot would look more like:

```
ev.slopes <- pca$rotation[2, ]/pca$rotation[1, ]
cars.centered <- transform(cars, speed=speed-mean(speed),
dist=dist-mean(dist))
qplot(data=cars.centered, x=speed, y=dist) +
geom_abline(intercept=0, slope=ev.slopes[1])
```

- Scale input data
- Reduce impact of dramatically different variances amongst input variables
- Pass
`scale.=TRUE`

to`prcomp`

or`cor=TRUE`

to`princomp`

- Rescale principal components
- Enforce that loadings are correlation between variables and components

- Rotate components
- Eases interpretation, but components no longer “principal” components
- Orthogonal: varimax, quartimax
- Oblique: oblimin, promax
- See package
`GPArotation`

- Project original data onto components
- Referred to as “scores”
- See
`princomp$scores`

or`prcomp$x`

Check out the `psych`

package

Provides:

`principal()`

`fa.parallel()`

`fa()`

- …and more!

- Collection of methods designed to uncover the latent structure in a given set of variables
- Factors are assumed to underlie the observed variables: \[ x_1 = \lambda_{11} f_1 + \lambda_{12} f_2 + \cdot\cdot\cdot + \lambda_{1n} f_n + u_1 \] \[ \vdots \] \[ x_k = \lambda_{k1} f_1 + \lambda_{k2} f_2 + \cdot\cdot\cdot + \lambda_{kn} f_n + u_k \]

(Recall that in PCA the assumption was that principal components were a linear combination of observed variables)

- Factor loadings and errors aren’t directly observable, but are inferred from the correlations among the variables
- Determine factor loadings as those that most accurately reproduce the sample covariance matrix

- Choose number of factors
- Run
`factanal`

`head(mtcars)`

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

```
sapply(1:3, function(f)
factanal(mtcars, factors=f, method="mle")$PVAL)
```

```
## objective objective objective
## 1.496220e-17 4.047510e-04 2.051923e-01
```

Find the first number of factors where the p-value is not significant. The results above suggest that we should use 3 factors.

`factanal`

`factanal(mtcars, factors=3, method="mle")`

```
##
## Call:
## factanal(x = mtcars, factors = 3, method = "mle")
##
## Uniquenesses:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0.135 0.055 0.090 0.127 0.290 0.060 0.051 0.223 0.208 0.125 0.158
##
## Loadings:
## Factor1 Factor2 Factor3
## mpg 0.643 -0.478 -0.473
## cyl -0.618 0.703 0.261
## disp -0.719 0.537 0.323
## hp -0.291 0.725 0.513
## drat 0.804 -0.241
## wt -0.778 0.248 0.524
## qsec -0.177 -0.946 -0.151
## vs 0.295 -0.805 -0.204
## am 0.880
## gear 0.908 0.224
## carb 0.114 0.559 0.719
##
## Factor1 Factor2 Factor3
## SS loadings 4.380 3.520 1.578
## Proportion Var 0.398 0.320 0.143
## Cumulative Var 0.398 0.718 0.862
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 30.53 on 25 degrees of freedom.
## The p-value is 0.205
```