Ellis Valentiner
April 9, 2015
Classification trees are a method commonly used in data mining. They are popular becase they are easy to interpret and reflect how (some) decisions are made.
The basic idea is to (recursively) divide the feature space into regions and fit a simple ‘model’ for each region.
Details:
We can estimate the probability that the class label is \(k\) given that the feature vector lies in region \(M\).
How do we choose the best splits?
We want splits that best separate our class labels.
How large do we grow the tree (when do we stop splitting?)
We want the number of observations in the terminal node to be reasonably large otherwise we are just modeling noise.
Stopping rules:
There are several different methods for choosing the best split.
The proportion of the observations in that region that do not belong to the most common class.
\[\text{Error} = 1 - \text{max}(\hat{p}_{mk})\]
Measures the total variance across the \(K\) classes.
Small values indicates that a node contains predominantly observations from a single class (i.e. small variation).
Attempts to separate classes by focusing on one class at a time. It will always favor working on the largest or, if you use costs or weights, the most “important” class in a node.
\[\text{Gini} = \overset{K}{\underset{k = 1}{\Sigma}} \hat{p}_{mk} (1 - \hat{p}_{mk})\]
Similar measure of node purity where small values also indicate node purity (like Gini).
Instead of separating one class at a time entropy tends to separate groups of classes, each accounting for roughly 50% of the observations.
Results in more evenly balanced trees.
\[\text{Deviance} = -\overset{K}{\underset{k = 1}{\Sigma}} \hat{p}_{mk} \text{log} \hat{p}_{mk}\]
It doesn’t really matter. Different split criterions usually give similar performance.
The tree
package uses defaults to deviance; random forests
uses Gini.
# Summary of dataset
summary(train_data)
## Group X1 X2
## Orange:297 Min. :-0.9804 Min. :-0.96966
## Blue :203 1st Qu.:-0.3941 1st Qu.:-0.34460
## Median : 0.1148 Median : 0.09573
## Mean : 0.1261 Mean : 0.07507
## 3rd Qu.: 0.6584 3rd Qu.: 0.52230
## Max. : 1.1442 Max. : 1.11792
# Print random 5 rows
head(train_data[sample(nrow(train_data)),])
## Group X1 X2
## 110 Blue 0.35868128 -0.1638366
## 45 Orange -0.34901851 0.1246819
## 66 Orange 0.90250625 0.5254130
## 438 Orange 0.05070714 0.6575863
## 324 Orange 0.76363161 0.1595582
## 14 Orange 0.81636073 0.2001557
# Load the `tree` package
library(tree)
# Fit the classification tree
tree_fit <- tree(Group ~ X1 + X2, data = train_data)
# Model summary
summary(tree_fit)
##
## Classification tree:
## tree(formula = Group ~ X1 + X2, data = train_data)
## Number of terminal nodes: 17
## Residual mean deviance: 0.242 = 116.9 / 483
## Misclassification error rate: 0.042 = 21 / 500
# Plot the dendrogram
plot(tree_fit, type = 'uniform')
# Add split labels
text(tree_fit)
# Plot the dendrogram
plot(tree_fit, type = 'uniform')
# Add split labels
text(tree_fit, all = TRUE)
# Get the predicted class labels for the test data
Yhat_test <- predict(tree_fit, newdata = test_data, type = "class")
# Confusion matrix
(tab <- table('Predicted label' = Yhat_test, 'True label' = test_data$Group))
## True label
## Predicted label Orange Blue
## Orange 284 14
## Blue 31 171
# Classification error
1 - (sum(diag(tab)) / sum(tab))
## [1] 0.09
Divide the feature space up into regions by making successive splits on the data and then within each region predict the most commonly occuring class.
Advantages
Disadvantages
Trees have high variance and are unstable.
Differences
Advantages
For each tree we have dataset of N randomly selected observations. On average approx. 1/3 of the data is unused.
The out-of-bag, or out-of-sample, error is the misclassification rate when we fit the data to the trees for which it was withheld.
This error is robust, there is no need for cross-validation.
OOB error can be calculated for each tree as you go, so you can find when adding additional trees no longer improves error.
For a single classification tree, any predictor that is used may be important.
Mean decrease in accuracy
Since random forests contains many trees we can measure the the classification error for the OOB data and after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.
Mean decrease in Gini
The total decrease in node impurities (Gini) from splitting on the variable, averaged over all trees.
randomForest
package# install.packages("randomForest")
library(randomForest)
rf_fit <- randomForest(Group ~ X1 + X2,
data = train_data,
xtest = test_data[,c("X1", "X2")],
ytest = test_data$Group,
ntree = 1000,
importance = TRUE)
##
## Call:
## randomForest(formula = Group ~ X1 + X2, data = train_data, xtest = test_data[, c("X1", "X2")], ytest = test_data$Group, ntree = 1000, importance = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 5.8%
## Confusion matrix:
## Orange Blue class.error
## Orange 283 14 0.04713805
## Blue 15 188 0.07389163
## Test set error rate: 4.6%
## Confusion matrix:
## Orange Blue class.error
## Orange 302 13 0.04126984
## Blue 10 175 0.05405405
## Orange Blue MeanDecreaseAccuracy MeanDecreaseGini
## X1 158.7368 170.96289 200.9958 159.77946
## X2 101.4301 96.22541 130.1410 80.59503
Extremely randomized forests - Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42. doi:10.1007/s10994-006-6226-1
extraTrees
packageGradient boosting machines - Friedman, J. H. (1999). Greedy Function Approximation: A Gradient Boosting Machine.
gbm
packageThursday, May 7, 2015 7:00 PM
Barracuda Networks 317 Maynard St, Ann Arbor, MI
R markdown: A tool for code sharing, reproducibility, and more! by Marian Schmidt
and
Support Vector Machines and their implementation in R by Michelle Berry