May 7, 2015

Acknowledgments

Overview of supervised learning

Predictive models that are calibrated on a set of training data and validated on a set of test data

What are Support Vector Machines?

SVM is typically used as a classification algorithm

  • It is an umbrella term for:
    • Maximal margin classifier
    • Support vector classifier
    • Support vector machine
  • It is a popular algorithm because:
    • It works well in a variety of settings
    • It sounds cool

Separating Hyperplanes

  • SVM classifies data with a separating hyperplane

  • A hyperplane is a flat surface with dimension p-1

  • If the data are linearly-seperable, there will be an infinite number of separating hyperplanes

Maximal Margin Classifier

  • The goal of the maximal margin classifier is to find a hyperplane with the widest margin between the two classes

  • We hope that a large margin on the training data will translate to a large margin on the test data

Support Vectors

  • Observations on the edge of the margin are called
    support vectors

  • If these points were moved slightly, the maximal margin hyperplane would move as well.

Maximal Margin is not ideal!

  • If a separating hyperplane exists, it will necessarily perfectly classify all training points
    • Classifier will likely overfit the training data and
      perform poorly on a naive set of data
  • Also, what if your classes are not linearly separable?

Solution: use a soft margin

Support Vector Classifier

  • We allow some points to cross the margin and hyperplane

  • Good solution if your classes are close to linearly separable

  • More robust than the maximal margin classifier because it is less sensitive to individual observations
    • model has lower variance

How to construct a Support Vector Classifier

Introduce a Cost Function that tells you how expensive it is for training data points to violate the margin

  • Cost is chosen through a cross-validation method such as
    k-folds

  • High Cost = smaller margin; lower bias, higher variance
  • Low Cost = wider margin; higher bias, lower variance

Examples of different Cost values

What if your decision boundary isn't linear at all?

  • Use a support vector machine to enlarge the feature space in ways to make the boundary linear

Support Vector Machine

One solution for enlarging the feature space is to use transformations of the features

For example, rather than fitting a support vector classifier with p features:

X1,X2, . . . Xp

Fit it with 2p features:

X1, (X1)^2, X2, (X2)^2, . . . Xp, (Xp)^2

  • In the expanded feature space the decision boundary is linear

Kernels

  • The previous method will quickly generate a very large feature space and may become computationally intractable

  • Instead, we can apply a kernel

SVM with more than 2 classes

SVM does not translate elegantly to a multiclass situation

  • One vs One
    • Construct classifiers for all possible pairwise combinations of classes
    • Pick the class that was selected the most times
  • One vs All
    • Compare one of the K classes to remaining K-1 classes
    • Pick the class that's furthest from the margin
    • The logic here is we have the most confidence that that data point actually belongs to that class

Example 1:

Support Vector Machine with radial kernel

Train model

library(e1071)
set.seed(1)
train=sample(200,100)
svmfit=svm(y~., data=radial.data[train,],
           kernel="radial", gamma=1, cost=1)
plot(svmfit, radial.data[train,])

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = radial.data[train, ], kernel = "radial", 
##     gamma = 1, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  35
## 
##  ( 20 15 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

Tune model

set.seed(1)
tune.out=tune(svm,y~., data=radial.data[train,], kernel="radial",
              ranges=list(cost=c(.1,1,10,100,1000), 
              gamma=c(.5,1,2,3,4)))

summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##   100   0.5
## 
## - best performance: 0.1 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-01   0.5  0.22 0.13984118
## 2  1e+00   0.5  0.11 0.12866839
## 3  1e+01   0.5  0.11 0.12866839
## 4  1e+02   0.5  0.10 0.12472191
## 5  1e+03   0.5  0.13 0.14181365
## 6  1e-01   1.0  0.22 0.13984118
## 7  1e+00   1.0  0.10 0.10540926
## 8  1e+01   1.0  0.11 0.12866839
## 9  1e+02   1.0  0.12 0.13165612
## 10 1e+03   1.0  0.16 0.10749677
## 11 1e-01   2.0  0.22 0.13984118
## 12 1e+00   2.0  0.10 0.12472191
## 13 1e+01   2.0  0.12 0.13165612
## 14 1e+02   2.0  0.17 0.10593499
## 15 1e+03   2.0  0.13 0.09486833
## 16 1e-01   3.0  0.22 0.13984118
## 17 1e+00   3.0  0.12 0.12292726
## 18 1e+01   3.0  0.13 0.10593499
## 19 1e+02   3.0  0.13 0.08232726
## 20 1e+03   3.0  0.18 0.11352924
## 21 1e-01   4.0  0.22 0.13984118
## 22 1e+00   4.0  0.13 0.12516656
## 23 1e+01   4.0  0.14 0.10749677
## 24 1e+02   4.0  0.14 0.08432740
## 25 1e+03   4.0  0.18 0.11352924

Test model

test <- -train
svm.pred <- predict(tune.out$best.model, 
            newx=radial.data[test,])
confusion.table <- table(true <- radial.data[test,"y"],svm.pred)
confusion.table
##    svm.pred
##      1  2
##   1 62 10
##   2 19  9
sum(diag(confusion.table))/sum(confusion.table)
## [1] 0.71

Example 2:

Support Vector Classifier with gene expression data

  • Khan dataset:
    • 83 observations of 2308 genes
    • Goal is to classify tissue samples into 4 tumor types based on gene expression

Train model

library(ISLR)
khan.train <- data.frame(x=Khan$xtrain,y=as.factor(Khan$ytrain))
# khan.tune <- tune(svm,y~., data=khan.train,kernel="linear",
                  # ranges=list(cost=c(.1,1,10,100,1000)))
khan.svm <- svm(y~., data=khan.train, kernel = "linear", cost=10)
  • gamma not required for linear kernel
    • uses 1/data dimension
  • this function uses the one vs one approach for multiclass svm

summary(khan.svm)
## 
## Call:
## svm(formula = y ~ ., data = khan.train, kernel = "linear", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.0004332756 
## 
## Number of Support Vectors:  58
## 
##  ( 20 20 11 7 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  1 2 3 4

Training data results

table(khan.svm$fitted, khan.train$y)
##    
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20
  • model perfectly classifies all the training data

Test model

khan.test <- data.frame(x=Khan$xtest, y = as.factor(Khan$ytest))
pred <- predict(khan.svm, newdata=khan.test)
khan.table <- table(pred,khan.test$y)
khan.table
##     
## pred 1 2 3 4
##    1 3 0 0 0
##    2 0 6 2 0
##    3 0 0 4 0
##    4 0 0 0 5
sum(diag(khan.table)/sum(khan.table))
## [1] 0.9
  • model performs very well on test data too!

SVM vs Logistic Regression

  • SVM does better when classes are (nearly) separable
  • When classes are not, LR (ridge penalty) and SVM are very similar
  • You can estimate probabilities with Logistic Regression
  • You can use kernels with LR too, but it is more costly computationally

Resources