Classification trees

Classification trees are a method commonly used in data mining. They are popular becase they are easy to interpret and reflect how (some) decisions are made.

image here

Classification trees

The basic idea is to (recursively) divide the feature space into regions and fit a simple ‘model’ for each region.

Details:

Grow in top-down fasion (divisive) by choosing the split that optimizies the criterion (greedy).
Grow a large tree and then prune it at the end.

We can estimate the probability that the class label is \(k\) given that the feature vector lies in region \(M\).

Some things to consider:

How do we choose the best splits?

We want splits that best separate our class labels.

How large do we grow the tree (when do we stop splitting?)

We want the number of observations in the terminal node to be reasonably large otherwise we are just modeling noise.

Stopping rules:

If all observations belong to the same class.
Too few observations to continue splitting.

Split criterions

There are several different methods for choosing the best split.

Misclassification error
Gini
Cross-entropy (deviance)

Misclassification error

The proportion of the observations in that region that do not belong to the most common class.

\[\text{Error} = 1 - \text{max}(\hat{p}_{mk})\]

Gini

Measures the total variance across the \(K\) classes.

Small values indicates that a node contains predominantly observations from a single class (i.e. small variation).

Attempts to separate classes by focusing on one class at a time. It will always favor working on the largest or, if you use costs or weights, the most “important” class in a node.

\[\text{Gini} = \overset{K}{\underset{k = 1}{\Sigma}} \hat{p}_{mk} (1 - \hat{p}_{mk})\]

Cross-entropy (deviance)

Similar measure of node purity where small values also indicate node purity (like Gini).

Instead of separating one class at a time entropy tends to separate groups of classes, each accounting for roughly 50% of the observations.

Results in more evenly balanced trees.

\[\text{Deviance} = -\overset{K}{\underset{k = 1}{\Sigma}} \hat{p}_{mk} \text{log} \hat{p}_{mk}\]

So which split criterion should you use?

It doesn’t really matter. Different split criterions usually give similar performance.

The tree package uses defaults to deviance; random forests uses Gini.

Two-dimensional example

# Summary of dataset
summary(train_data)

##     Group           X1                X2          
##  Orange:297   Min.   :-0.9804   Min.   :-0.96966  
##  Blue  :203   1st Qu.:-0.3941   1st Qu.:-0.34460  
##               Median : 0.1148   Median : 0.09573  
##               Mean   : 0.1261   Mean   : 0.07507  
##               3rd Qu.: 0.6584   3rd Qu.: 0.52230  
##               Max.   : 1.1442   Max.   : 1.11792

Two-dimensional example

# Print random 5 rows
head(train_data[sample(nrow(train_data)),])

##      Group          X1         X2
## 110   Blue  0.35868128 -0.1638366
## 45  Orange -0.34901851  0.1246819
## 66  Orange  0.90250625  0.5254130
## 438 Orange  0.05070714  0.6575863
## 324 Orange  0.76363161  0.1595582
## 14  Orange  0.81636073  0.2001557

Fitting the tree

# Load the `tree` package
library(tree)
# Fit the classification tree
tree_fit <- tree(Group ~ X1 + X2, data = train_data)
# Model summary
summary(tree_fit)

## 
## Classification tree:
## tree(formula = Group ~ X1 + X2, data = train_data)
## Number of terminal nodes:  17 
## Residual mean deviance:  0.242 = 116.9 / 483 
## Misclassification error rate: 0.042 = 21 / 500

# Plot the dendrogram
plot(tree_fit, type = 'uniform')
# Add split labels
text(tree_fit)

Dendrogram

# Plot the dendrogram
plot(tree_fit, type = 'uniform')
# Add split labels
text(tree_fit, all = TRUE)

Two-dimensional example: No splits

Two-dimensional example: 1 split

Two-dimensional example: 2 splits

Two-dimensional example: 3 splits

Two-dimensional example: 4 splits

Two-dimensional example: 8 splits

Performance

# Get the predicted class labels for the test data
Yhat_test <- predict(tree_fit, newdata = test_data, type = "class")
# Confusion matrix
(tab <- table('Predicted label' = Yhat_test, 'True label' = test_data$Group))

##                True label
## Predicted label Orange Blue
##          Orange    284   14
##          Blue       31  171

# Classification error
1 - (sum(diag(tab)) / sum(tab))

## [1] 0.09

Classification tree recap

Divide the feature space up into regions by making successive splits on the data and then within each region predict the most commonly occuring class.

Advantages

Easy to interpret.
Similar to human decision making.
Do well with non-linear problems.
Fast and scalable.

Disadvantages

Trees have high variance and are unstable.

Single trees may overfit the training data.

Small changes in the data will have big changes on the tree.

Random forests

Differences

Instead of growing a single tree, grow many trees.

For each tree withhold approx. 1/3 of the data as a test set (bagging).

For each split consider only some of the predictors (approx. \(\sqrt{p}\)) which decorrelates the trees by introducing variance.

Predictions are made using a majority vote strategy (or sometimes averaged probabilities).

Advantages

Built-in test set.

Built-in measures of variable importance.

More stable predictions.

Out-of-bag (OOB) error

For each tree we have dataset of N randomly selected observations. On average approx. 1/3 of the data is unused.

The out-of-bag, or out-of-sample, error is the misclassification rate when we fit the data to the trees for which it was withheld.

This error is robust, there is no need for cross-validation.

OOB error can be calculated for each tree as you go, so you can find when adding additional trees no longer improves error.

Variable importance

For a single classification tree, any predictor that is used may be important.

Mean decrease in accuracy

Since random forests contains many trees we can measure the the classification error for the OOB data and after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.

Mean decrease in Gini

The total decrease in node impurities (Gini) from splitting on the variable, averaged over all trees.

`randomForest` package

# install.packages("randomForest")
library(randomForest)

rf_fit <- randomForest(Group ~ X1 + X2,
                       data = train_data,
                       xtest = test_data[,c("X1", "X2")],
                       ytest = test_data$Group,
                       ntree = 1000,
                       importance = TRUE)

Two-dimensional example

## 
## Call:
##  randomForest(formula = Group ~ X1 + X2, data = train_data, xtest = test_data[,      c("X1", "X2")], ytest = test_data$Group, ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 5.8%
## Confusion matrix:
##        Orange Blue class.error
## Orange    283   14  0.04713805
## Blue       15  188  0.07389163
##                 Test set error rate: 4.6%
## Confusion matrix:
##        Orange Blue class.error
## Orange    302   13  0.04126984
## Blue       10  175  0.05405405

Misclassification rates

Variable importance

##      Orange      Blue MeanDecreaseAccuracy MeanDecreaseGini
## X1 158.7368 170.96289             200.9958        159.77946
## X2 101.4301  96.22541             130.1410         80.59503

Additional topics

Extremely randomized forests - Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42. doi:10.1007/s10994-006-6226-1

available in the extraTrees package

Gradient boosting machines - Friedman, J. H. (1999). Greedy Function Approximation: A Gradient Boosting Machine.

available in the gbm package

Next month

Thursday, May 7, 2015 7:00 PM

Barracuda Networks 317 Maynard St, Ann Arbor, MI

R markdown: A tool for code sharing, reproducibility, and more! by Marian Schmidt

and

Support Vector Machines and their implementation in R by Michelle Berry

Classification: trees and random forests

Classification trees

Classification trees

Some things to consider:

Split criterions

Misclassification error

Gini

Cross-entropy (deviance)

So which split criterion should you use?

Two-dimensional example

Two-dimensional example

Fitting the tree

Dendrogram

Two-dimensional example: No splits

Two-dimensional example: 1 split

Two-dimensional example: 2 splits

Two-dimensional example: 3 splits

Two-dimensional example: 4 splits

Two-dimensional example: 8 splits

Two-dimensional example: 8 splits

Performance

Classification tree recap

Random forests

Out-of-bag (OOB) error

Variable importance

randomForest package

Two-dimensional example

Misclassification rates

Variable importance

Summary

Additional topics

Next month

`randomForest` package