Abstract

Background

Human Activity Recognition (HAR) data has become ubiquitous with the advent of devices like the Jawbone Up, Nike FuelBand, Fitbit and even smart-phones. Although users of these devices tend to quantify how much they participate in an activity, they rarely consider “how well” they perform the activity.

Results

Using RStudio, we import HAR data for weight-lifting exercises provided from a Groupware@LES study, to see if we can predict how well exercises were performed by users of HAR devices. After some initial data wrangling, we analyze the data, visualize the results, and develop a machine learning model. This deep-dive analysis shows how to perform the typical steps for this type of machine learning activity.

Conclusion

RStudio is an excellent platform for this type of analytical project and helps analysts tell a story with data. In combination with RMarkdown, RStudio readily facilitates analysis, sharing, and reproducibility. Analysts may weave together narrative text and code to seamlessly produce elegantly formatted in various output formats (i.e. html, pdf, MS Word, etc.).


Implementation

With devices like the Jawbone Up, Nike FuelBand and Fitbit, it is feasible to inexpensively collect and analyze a large quantity of HAR data. With the Groupware@LES HAR data, we analyze six study participants that were asked to perform dumbbell exercises with correct form (Class A) and with the most common mistakes: throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Using this HAR class data, we leverage machine learning algorithms to develop a model to distinguish between participants that have exercised correctly versus those that haven’t, and what their mistakes may have been. We will then use this model to assess how accurately we can classify the exercises performed by other individuals solely based on their HAR data.

Technical Notes: This analysis is statically generated with RStudio 3.3.2 using RMarkdown 1.3 for reproducibility. The complete codebase is available on my github page. Many code blocks are cached for performant iterative analysis. When updating code blocks, pay special attention to subsequent cached code blocks that should be updated as a result. To learn more, have a peek at the RStudio/RMarkdown documentation and cheatsheets.

Data Processing

Getting the Data

Download the training and test data, if not already in the project data folder.

setwd("/home/rstudio/code")

trnFile = "data/pml-har-trn.csv"
tstFile = "data/pml-har-tst.csv"
if (!file.exists(trnFile)) {
    trnfileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
    download.file(trnfileUrl, destfile = trnFile, method = "curl")
}
if (!file.exists(tstFile)) {
    tstfileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
    download.file(tstfileUrl, destfile = tstFile, method = "curl")
}

Data Wrangling and Preparation

The goal of data preparation for machine learning, is to identify the variables that explain as much of the class variance possible. In other words, we need to find the fewest quantity of variables that can explain everything that is going on. To improve model performance and simplify our analysis we discard those variables not related to exercise performance.

# Load libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(knitr)
library(corrplot)
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(ggplot2)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Define number of parallel processes
registerDoMC(cores = 8)

# Load the training file
trn.dat = read.csv(trnFile, na.strings=c("NA","NaN", "", " "), stringsAsFactors=FALSE); rm("trnFile")

# Quick exploratory analysis
dim.init <- dim(trn.dat); dim.init
## [1] 19622   160
str(trn.dat)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window              : chr  "no" "no" "no" "no" ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : chr  NA NA NA NA ...
##  $ kurtosis_picth_belt     : chr  NA NA NA NA ...
##  $ kurtosis_yaw_belt       : chr  NA NA NA NA ...
##  $ skewness_roll_belt      : chr  NA NA NA NA ...
##  $ skewness_roll_belt.1    : chr  NA NA NA NA ...
##  $ skewness_yaw_belt       : chr  NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : chr  NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : chr  NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : chr  NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : chr  NA NA NA NA ...
##  $ kurtosis_picth_arm      : chr  NA NA NA NA ...
##  $ kurtosis_yaw_arm        : chr  NA NA NA NA ...
##  $ skewness_roll_arm       : chr  NA NA NA NA ...
##  $ skewness_pitch_arm      : chr  NA NA NA NA ...
##  $ skewness_yaw_arm        : chr  NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : chr  NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : chr  NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : chr  NA NA NA NA ...
##  $ skewness_roll_dumbbell  : chr  NA NA NA NA ...
##  $ skewness_pitch_dumbbell : chr  NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : chr  NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : chr  NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : chr  NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

## Removing unnecessary covariates
# Columns not related to exercise performance.
col.nrel <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", 
              "cvtd_timestamp", "new_window", "num_window")
# Columns with near zero variance
nsv <- nearZeroVar(trn.dat, saveMetrics = TRUE)
col.nzvs <- rownames(nsv[nsv$nzv == TRUE, ])
# Update the training set with the remaining fields
col.drops <- c(col.nrel, col.nzvs)
trn.dat.all <- trn.dat[,!(names(trn.dat) %in% col.drops)]

# Identify fields with large quantity of missing values.
col.names = colnames(trn.dat.all)
cnt.min = sum(complete.cases(trn.dat.all))
trn.cnt.df <- data.frame(Field=character(), FieldCnt=integer(), stringsAsFactors=FALSE)
# Create data-frame with field names and counts.
for (fld in 1:length(col.names)) {
    trn.cnt.df[fld,] <- c(col.names[fld], sum(!is.na(trn.dat.all[[col.names[fld]]])))
}
# Filter out the fields with high quantity of missing values.
col.keep <- as.vector(subset(trn.cnt.df, FieldCnt < cnt.min)$Field)
# Update the training set with the remaining fields
trn.dat.all <- trn.dat.all[col.keep]

## Cleanup and default model settings
# Set the classe variable as a factor.
trn.dat.all$classe <- as.factor(trn.dat.all$classe)
dim.finl <- dim(trn.dat.all); dim.finl
## [1] 19622    53

# house-cleaning
rm("cnt.min", "col.drops", "col.keep", "col.names", "col.nrel", "col.nzvs", "fld", "trn.cnt.df", "nsv","trn.dat")

# Cutoff parameters
corrRt = 0.80 # Correlation Cutoff 
dSplit = 0.60 # Training Cutoff

## Set model parameters for model tuning and training
cntFld = 10    # Number of cross-validation folds
cntRpt = 8    # Increase the Repeat count for improved cross-validation accuracy
cntTun = 5    # Parameter for tuning accuracy vs resource consumption

Our exploratory analysis shows that the Groupware@LES file is just shy of twenty thousand records and that some of the variables have near-zero-variance values or have missing values. After discarding them, our variable count drops from 160 down to 53. The class variables in the provided data is relatively balanced, which simplifies this analysis somewhat.

kable(t(table(trn.dat.all$classe)), caption = "Class Variable Frequency:")
Class Variable Frequency:
A B C D E
5580 3797 3422 3216 3607

Handling Multicollinearity

Multicollinearity occurs when model variables are correlated to your response variable as well as each other. This overinflates the standard errors which makes some variables seem statistically insignificant when they should be significant. By identifying the Variable Inflation Factor (VIF), we can quantify the severity of this impact. We implemented a stepwise VIF function to identify multicollinear variables within a specified threshold. A value greater than 10 is considered to be highly collinear. The threshold may be set to the specific tolerance required for your analysis.

We can reduce the number of predictors to a smaller set of uncorrelated components using Partial Least Squares Regression (PLS) or Principal Components Analysis (PCA) or by simply removing them from the model. Removing predictors may introduce some bias (difference between expected prediction and the truth), but may also reduce the prediction variance (increase accuracy). The goal is to optimize the trade-off between bias and variance. These options reduce our variable count and speeds our model calculation.

The goal of PCA is to explain the maximum amount of variance with the fewest number of principal components. PCA creates these components by transforming the variables into a smaller sub-space of variables (dimensionality reduction) that are uncorrelated with each other. A challenge with the components is that it can be difficult to explain what is driving model performance. Using the Caret R package, we apply a Box-Cox Transformation to correct for skewness, center and scale each variable and then apply PCA in one call to pre-process the variables.

We are going to approach this three different ways to get a sense of effort and impact:

  1. remove the multicollinear predictors,
  2. apply PCA to the multicollinear predictors, and
  3. apply PCA to all the predictors.
# compile stepwise VIF selection function for reducing collinearity among explanatory variables
vif_func<-function(in_frame, thresh=10, trace=T, ...){
  require(fmsb)
  
  if(class(in_frame) != 'data.frame') in_frame<-data.frame(in_frame)
  
  #get initial vif value for all comparisons of variables
  vif_init<-NULL
  var_names <- names(in_frame)
  for(val in var_names){
      regressors <- var_names[-which(var_names == val)]
      form <- paste(regressors, collapse = '+')
      form_in <- formula(paste(val, '~', form))
      vif_init<-rbind(vif_init, c(val, VIF(lm(form_in, data = in_frame, ...))))
  }
  vif_max<-max(as.numeric(vif_init[,2]))

  if(vif_max < thresh){
    if(trace==T){ #print output of each iteration
        prmatrix(vif_init,collab=c('var','vif'), rowlab=rep('', nrow(vif_init)), quote=F)
        cat('\n')
        cat(paste('All variables have VIF < ', thresh,', max VIF ',round(vif_max,2), sep=''),'\n\n')
    }
    return(var_names)
  }
  else {
    in_dat<-in_frame

    #backwards selection of explanatory variables, stops when all VIF values are below 'thresh'
    while(vif_max >= thresh) {
      
      vif_vals<-NULL
      var_names <- names(in_dat)
        
      for(val in var_names){
        regressors <- var_names[-which(var_names == val)]
        form <- paste(regressors, collapse = '+')
        form_in <- formula(paste(val, '~', form))
        vif_add<-VIF(lm(form_in, data = in_dat, ...))
        vif_vals<-rbind(vif_vals,c(val,vif_add))
      }
      max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2])))[1]

      vif_max<-as.numeric(vif_vals[max_row,2])

      if(vif_max<thresh) break
      
      if(trace==T){ #print output of each iteration
        prmatrix(vif_vals,collab=c('var','vif'),rowlab=rep('',nrow(vif_vals)),quote=F)
        cat('\n')
        cat('removed: ',vif_vals[max_row,1],vif_max,'\n\n')
        flush.console()
      }

      in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]
    }
    return(names(in_dat))
  }
}

Select the tabs below to switch between the plots, which highlight multicollinearity across variables in each approach. Review the summary below each plot to see the change in variable count and understand the impact on multicollinearity overall.

Baseline Correlation Matrix

This plot is difficult to read, but the key take-away is that the baseline variables are unusable for modelling due to significant multicollinearity. This is emphasized by the darker colored cross-sections.

# Build a correlation matrix on the initial variables, excluding the "classe" variable
corrMatx <- cor(trn.dat.all[, -dim(trn.dat.all)[2]])
corrSumm <- summary(corrMatx[upper.tri(corrMatx)])[1:6]
corrTabl <- t(as.data.frame(as.vector(corrSumm)))

colnames(corrTabl) <- attr(corrSumm, 'names') # c("Minimum", "1st Quartile", "Median", "Mean", "3rd Quartile", "Maximum") #
rownames(corrTabl) <- "Baseline Summary"

# plot the correlation matrix
corrplot(corrMatx, 
         order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.75, tl.col = "black")


# print the correlation summary
kable(dim(trn.dat.all)[2], col.names = c("Variables"), format='html')
Variables
53
kable(corrTabl, digits = 4, align = 'r', format='markdown')
Min. 1st Qu. Median Mean 3rd Qu. Max.
Baseline Summary -0.992 -0.1101 0.0021 0.0018 0.0925 0.9809

Drop Correlation Matrix

Using the stepwise VIF function, we dropped variables with collinearity threshold greater than two. Notice the fainter colors compared to the initial matrix. Although we significantly improved our collinearity, we may loose some of the predictive variance in the dropped variables.

# use stepwise VIF remove the highly correlated values from the training variable list
var.keep <- vif_func(in_frame=trn.dat.all[, -dim(trn.dat.all)[2]], thresh=2, trace=F)

trn.dat.drp  <- trn.dat.all[,c(var.keep, "classe")]

# calculate the final VIF value for the training set with the dropped variables
var.keep <- vif_func(in_frame=trn.dat.drp[, -dim(trn.dat.drp)[2]], thresh=2, trace=T)
##  var                  vif             
##  gyros_belt_x         1.77619828389346
##  gyros_belt_z         1.34523435553331
##  magnet_belt_y        1.42090571996023
##  roll_arm             1.57141620051462
##  pitch_arm            1.58544980918429
##  yaw_arm              1.30081927466778
##  total_accel_arm      1.19690560612217
##  gyros_arm_x          1.22344553137577
##  gyros_arm_z          1.68628616597968
##  magnet_arm_x         1.75525855461027
##  roll_dumbbell        1.77193973254286
##  pitch_dumbbell       1.86197087067641
##  total_accel_dumbbell 1.95658711484206
##  gyros_dumbbell_y     1.44480121316314
##  roll_forearm         1.31411910609692
##  yaw_forearm          1.64142799639081
##  total_accel_forearm  1.31980243865119
##  gyros_forearm_x      1.70178568584006
##  gyros_forearm_y      1.73220667730309
##  accel_forearm_z      1.60105464239125
##  magnet_forearm_x     1.32156481583226
##  magnet_forearm_y     1.49600102813836
## 
## All variables have VIF < 2, max VIF 1.96
# Build a correlation matrix on the initial variables, excluding the "classe" variable
corrMatx <- cor(trn.dat.drp[, -dim(trn.dat.drp)[2]])
corrSumm <- summary(corrMatx[upper.tri(corrMatx)])[1:6]
corrTabl <- t(as.data.frame(as.vector(corrSumm)))

colnames(corrTabl) <- attr(corrSumm, 'names')
rownames(corrTabl) <- "Drop Summary"

# plot the correlation matrix
corrplot(corrMatx, 
         order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.75, tl.col = "black")


# print the correlation summary
kable(dim(trn.dat.drp)[2], col.names = c("Variables"), format='html')
Variables
23
kable(corrTabl, digits = 4, align = 'r', format='markdown')
Min. 1st Qu. Median Mean 3rd Qu. Max.
Drop Summary -0.5319 -0.0574 0.01 0.0091 0.0778 0.5011

Hybrid Correlation Matrix

Rather than dropping all of the variables that exceed our threshold, this step extracts their principle components using PCA. The surviving variables and priciple components are then combined to create our training set. Most of the significant collinearity is removed while maintaining the predictive variance. Depending on your needs, the remaining collinearity may still be too high.

# Reassign variables to the PCA list that push our VIF threshold above 6
var.pck <- setdiff(var.keep, 
                   c("magnet_arm_x", "magnet_forearm_x", "gyros_arm_x", 
                     "magnet_forearm_y", "gyros_forearm_y", "total_accel_dumbbell"))

# Identify the collinear variables
var.pca <- setdiff(colnames(trn.dat.all), c(var.pck, "classe"))

# Preprocess the collinear variables
trn.ppv = preProcess(trn.dat.all[,c(var.pca)], method=c("BoxCox", "center", "scale", "pca"))

# Model the PCA components
trn.pca = (predict(trn.ppv, trn.dat.all[,c(var.pca)]))
## Warning in is.na(lam): is.na() applied to non-(list or vector) of type
## 'NULL'

# Create hybrid dataframe with noncollinear and pca variables.
trn.dat.hyb  <- cbind(trn.dat.all[,c(var.pck)], trn.pca, classe=trn.dat.all$classe)

# check for any remaining multicollinearity
var.pck <- vif_func(in_frame=trn.dat.hyb[, -dim(trn.dat.hyb)[2]], thresh=9, trace=T)
##  var                 vif             
##  gyros_belt_x        3.31789201675371
##  gyros_belt_z        2.42849164325859
##  magnet_belt_y       3.50222035177036
##  roll_arm            2.00971443486715
##  pitch_arm           1.79503632379843
##  yaw_arm             1.34711160187722
##  total_accel_arm     2.28352087026064
##  gyros_arm_z         2.41099423077938
##  roll_dumbbell       2.52451436108404
##  pitch_dumbbell      3.90244415241768
##  gyros_dumbbell_y    1.99746736714151
##  roll_forearm        1.53016707390642
##  yaw_forearm         1.70219092561984
##  total_accel_forearm 1.49518955648635
##  gyros_forearm_x     2.99888704127112
##  accel_forearm_z     3.64910457658203
##  PC1                 3.89238900830164
##  PC2                 5.33765975875524
##  PC3                 2.3092285807121 
##  PC4                 2.07114739930967
##  PC5                 4.05071781388632
##  PC6                 3.12727507721498
##  PC7                 1.66528105065898
##  PC8                 1.49857627121578
##  PC9                 1.41945553681272
##  PC10                2.43447020550641
##  PC11                2.03570673916657
##  PC12                1.62852301913156
##  PC13                1.45250217630237
##  PC14                1.57873278234641
##  PC15                1.65837047683967
## 
## All variables have VIF < 9, max VIF 5.34
# Build a correlation matrix on the initial variables, excluding the "classe" variable
corrMatx <- cor(trn.dat.hyb[, -dim(trn.dat.hyb)[2]])
corrSumm <- summary(corrMatx[upper.tri(corrMatx)])[1:6]
corrTabl <- t(as.data.frame(as.vector(corrSumm)))

colnames(corrTabl) <- attr(corrSumm, 'names')
rownames(corrTabl) <- "Hybrid Summary"

# plot the correlation matrix
corrplot(corrMatx, 
         order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.75, tl.col = "black")


# print the correlation summary
kable(dim(trn.dat.hyb)[2], col.names = c("Variables"), format='html')
Variables
32
kable(corrTabl, digits = 4, align = 'r', format='markdown')
Min. 1st Qu. Median Mean 3rd Qu. Max.
Hybrid Summary -0.6064 -0.0724 0 -0.0027 0.0576 0.6142

PCA Correlation Matrix

For maximum reduction of multicollinearity, this approach transforms all the variables into principle components. Notice how we have eliminated all multicollinearity. The only real challenge with this approach, is that it is less intuitive to understand and describe what drives the model performance.

# Identify all the variables
var.pca  <- setdiff(colnames(trn.dat.all), c("classe"))

# Preprocess all the variables
trn.ppv = preProcess(trn.dat.all[,c(var.pca)], method=c("BoxCox", "center", "scale", "pca"))

# Model the PCA components
trn.pca = (predict(trn.ppv, trn.dat.all[,c(var.pca)]))

# Create pca dataframe.
trn.dat.pca  <- cbind(trn.pca, classe=trn.dat.all$classe)

# check for any remaining multicollinearity
var.pca <- vif_func(in_frame=trn.dat.pca[, -dim(trn.dat.pca)[2]], thresh=9, trace=T)
##  var  vif
##  PC1  1  
##  PC2  1  
##  PC3  1  
##  PC4  1  
##  PC5  1  
##  PC6  1  
##  PC7  1  
##  PC8  1  
##  PC9  1  
##  PC10 1  
##  PC11 1  
##  PC12 1  
##  PC13 1  
##  PC14 1  
##  PC15 1  
##  PC16 1  
##  PC17 1  
##  PC18 1  
##  PC19 1  
##  PC20 1  
##  PC21 1  
##  PC22 1  
##  PC23 1  
##  PC24 1  
##  PC25 1  
## 
## All variables have VIF < 9, max VIF 1
# Build a correlation matrix on the initial variables, excluding the "classe" variable
corrMatx <- cor(trn.dat.pca[, -dim(trn.dat.pca)[2]])
corrSumm <- summary(corrMatx[upper.tri(corrMatx)])[1:6]
corrTabl <- t(as.data.frame(as.vector(corrSumm)))

colnames(corrTabl) <- attr(corrSumm, 'names')
rownames(corrTabl) <- "PCA Summary"

# plot the correlation matrix
corrplot(corrMatx, 
         order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.75, tl.col = "black")


# print the correlation summary
kable(dim(trn.dat.pca)[2], col.names = c("Variables"), format='html')
Variables
26
kable(corrTabl, digits = 4, align = 'r', format='markdown')
Min. 1st Qu. Median Mean 3rd Qu. Max.
PCA Summary 0 0 0 0 0 0

Machine Learning

Tuning Preparation

We are using a 60% data split to define our training set. To provide a meaningful estimation of performance, the remaining 40% will be divided equally to define our validation and test sets. The individual models are tuned with the training set, and assessed against the validation set. The test set is for estimating the ensemble model performance.

With the Caret R package, we use repeated k-fold cross-validation with 10 folds across 8 repetitions to improve our accuracy estimates. We preprocess the data with “centered and scaled” to better expose the underlying structure and relationships to the predictors. We pre-calculate a vector of seeds for reproducibility across multiple parallel model runs.

# Select the PCA dataset for training.
trn.dat <- trn.dat.pca

# Partition the data with training, validation, and test sets.
set.seed(1732)
inTrain <- createDataPartition(y=trn.dat$classe, p=dSplit, list=FALSE)
trn.trn <- trn.dat[ inTrain,]
trn.vld <- trn.dat[-inTrain,]
inValid <- createDataPartition(y=trn.vld$classe, p=0.50, list=FALSE)
trn.tst <- trn.vld[-inValid,]
trn.vld <- trn.vld[ inValid,]

# Setup seeds for running fully reproducible model in parallel mode
ivect = cntRpt*cntFld                                        # quantity of integer vectors
seeds <- vector(mode = "list", length = ivect+1)             # length equals the integer vectors plus one
for(i in 1:ivect) seeds[[i]] <- sample.int(n=1000, cntTun*3) # seeds for tuning attempts (*3 for Radial SVM)
seeds[[length(seeds)]] <- sample.int(1000, 1)                # seed for the final model

# Create cross-validation folds to use for model tuning
myMetr <- "Kappa"     # ROC, Kappa, Accuracy
myFlds <- createMultiFolds(y=trn.trn$classe, k = cntFld, times = cntRpt)
myCtrl <- trainControl(method = "repeatedCV", seeds = seeds, allowParallel = TRUE, 
                       #savePredictions=TRUE, classProbs=TRUE, summaryFunction=multiClassSummary,
                       number = cntFld, repeats = cntRpt)

# Standardize the training subset.
myPrep = c("center", "scale")   # pre-process with center and scaling

# house-cleaning
rm("i", "cntFld", "cntRpt", "inTrain", "corrMatx", "vif_func")

Grid Searching

There are two ways to tune an algorithm using the Caret R package. The first method is to allow the system to automatically do it, by setting the tuneLength parameter to indicate the number of different values to try for each algorithm parameter. This makes a crude guess on what values to try, but can often get you within a reasonable range. This is the approach we’ve taken on our first tuning pass. The second approach involves manually setting the tuning grid to search. Our second tuning pass uses this approach. We start with the best performing parameters from the first tuning pass and create a more focused tuning grid to further refine our parameters.

We have selected three models to tune:

  1. Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel,
  2. RandomForest, and
  3. Stochastic Gradient Boost Machine.

To improve SVM performance we doubled its grid-search area during its first pass. Also, since the GBM algorithm is computationally expensive, we did significant manual tuning during its second pass to attain reasonable performance. For good measure, we created an Ensemble (Stack) model using the three separate algorithms to improve overall performance.

Technical Notes: Since this process can be very resource intensive, we use OpenBlas, which is an optimized library to replace the standard BLAS library used by R. In addition, we use the “doMC” library to specify the number cores to correspond with the maximum number of R jobs to run in parallel. Depending on your tuning parameters and training set size, this process can easily consume all of your computer memory and may take hours to run. Running machine learning on an H2O.ai cluster helps to alleviate these challenges. For more details, select the “Distributed Random Forest” tab below.

SVM Passes

First-pass Logic:

# First Model: Radial Support Vector Machine
# First Pass: Tunelength
system.time(
    # Fit model to the training set.
    tune.svr1 <- train(classe ~ ., data = trn.trn, method = "svmRadial",
                      tuneLength = cntTun*2, preProcess = myPrep, 
                      metric = myMetr, trControl = myCtrl)
)
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
##     user   system  elapsed 
## 24074.66 16153.02 18107.58
tune.svr1;
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 11776 samples
##    25 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10599, 10599, 10599, 10601, 10599, 10598, ... 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa    
##     0.25  0.8324342  0.7878624
##     0.50  0.8668969  0.8314050
##     1.00  0.8990850  0.8721231
##     2.00  0.9255224  0.9056554
##     4.00  0.9464138  0.9321350
##     8.00  0.9616367  0.9514390
##    16.00  0.9699997  0.9620335
##    32.00  0.9752714  0.9687119
##    64.00  0.9792541  0.9737544
##   128.00  0.9812661  0.9763018
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02546125
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.02546125 and C = 128.

Second-pass Logic:

# First Model: Radial Support Vector Machine
# Second Pass: Grid search based on the parameter values selected in the first pass
mySig = round(tune.svr1$bestTune$sigma, 3)
myCst = tune.svr1$bestTune$C
myGrd = expand.grid(sigma = c(mySig-.005, mySig, mySig+.005),
                    C = c(myCst-5, myCst-2, myCst, myCst+2, myCst+5))

system.time(
    # Fit model to the training set.
    tune.svr2 <- train(classe ~ ., data = trn.trn, method = "svmRadial",
                       tuneGrid = myGrd, preProcess = myPrep, 
                       metric = myMetr, trControl = myCtrl)
)
##      user    system   elapsed 
##  6536.600 11759.313  2692.237
tune.svr2;
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 11776 samples
##    25 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10597, 10598, 10599, 10598, 10598, 10599, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C    Accuracy   Kappa    
##   0.020  123  0.9786695  0.9730145
##   0.020  126  0.9784493  0.9727355
##   0.020  128  0.9787402  0.9731037
##   0.020  130  0.9782583  0.9724951
##   0.020  133  0.9782591  0.9724958
##   0.025  123  0.9804082  0.9752153
##   0.025  126  0.9797802  0.9744204
##   0.025  128  0.9806034  0.9754615
##   0.025  130  0.9807813  0.9756867
##   0.025  133  0.9806902  0.9755724
##   0.030  123  0.9823909  0.9777244
##   0.030  126  0.9823292  0.9776469
##   0.030  128  0.9821991  0.9774824
##   0.030  130  0.9822425  0.9775371
##   0.030  133  0.9822996  0.9776091
## 
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.03 and C = 123.

Random Forest

First-pass Logic:

# Second Model: Random Forest Model (Extension to Bagging where Trees represent 
#               Bootstrap samples with Bootstrap nodes)
# First Pass: Tune Length
system.time(
    #Fit a random forest model to the training set. Include Importance and Proximity scores
    tune.rft1 <- train(classe ~ ., data = trn.trn, method="rf",
                       #proximity=TRUE, importance=TRUE,  # produce extra info 
                       tuneLength = cntTun, preProcess = myPrep, 
                       metric = myMetr, trControl = myCtrl)
)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
##     user   system  elapsed 
## 8286.293 1893.681 1071.296
tune.rft1
## Random Forest 
## 
## 11776 samples
##    25 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10597, 10598, 10599, 10598, 10598, 10599, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9675717  0.9589737
##    7    0.9650024  0.9557260
##   13    0.9610645  0.9507485
##   19    0.9572961  0.9459845
##   25    0.9500997  0.9368821
## 
## Kappa was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

Second-pass Logic:

# Second Model: Random Forest Model (Extension to Bagging where Trees represent 
#               Bootstrap samples with Bootstrap nodes)
# Second Pass: Grid search based on the parameter values selected in the first pass
myMtry = max(2, tune.rft1$bestTune$mtry)
myGrd = expand.grid(mtry = c(myMtry-1, myMtry, myMtry+1))

system.time(
    # Fit model to the training set.
    tune.rft2 <- train(classe ~ ., data = trn.trn, method = "rf",
                       tuneGrid = myGrd, preProcess = myPrep, 
                       metric = myMetr, trControl = myCtrl)
)
##     user   system  elapsed 
## 4376.450   12.110  499.645
tune.rft2;
## Random Forest 
## 
## 11776 samples
##    25 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10599, 10599, 10597, 10598, 10598, 10598, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.9670300  0.9582867
##   2     0.9675927  0.9590027
##   3     0.9674332  0.9588003
## 
## Kappa was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

Gradient Boost

First-pass Logic:

# Third Model: Stochastic Gradient Boost Machine
# First Pass: Tune Length
system.time(
    #Fit a random forest model to the training set. Include Importance and Proximity scores
    tune.gbm1 <- train(classe ~ ., data = trn.trn, method="gbm", verbose=FALSE,
                       tuneLength = cntTun, preProcess = myPrep, 
                       metric = myMetr, trControl = myCtrl)
)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr
##     user   system  elapsed 
## 7251.765   12.320 1005.324
tune.gbm1
## Stochastic Gradient Boosting 
## 
## 11776 samples
##    25 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10599, 10599, 10597, 10598, 10598, 10598, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.5655798  0.4413891
##   1                  100      0.6223259  0.5181829
##   1                  150      0.6529919  0.5581476
##   1                  200      0.6721623  0.5829542
##   1                  250      0.6851750  0.5997820
##   2                   50      0.6553595  0.5609974
##   2                  100      0.7258409  0.6518923
##   2                  150      0.7620582  0.6981533
##   2                  200      0.7862183  0.7289950
##   2                  250      0.8060144  0.7542075
##   3                   50      0.7137819  0.6363846
##   3                  100      0.7786710  0.7193591
##   3                  150      0.8154823  0.7662459
##   3                  200      0.8405764  0.7981354
##   3                  250      0.8589720  0.8214626
##   4                   50      0.7511771  0.6841993
##   4                  100      0.8148239  0.7653929
##   4                  150      0.8498854  0.8099350
##   4                  200      0.8742566  0.8408339
##   4                  250      0.8919943  0.8633073
##   5                   50      0.7780858  0.7185874
##   5                  100      0.8390576  0.7961497
##   5                  150      0.8743628  0.8409646
##   5                  200      0.8958578  0.8681977
##   5                  250      0.9102515  0.8864271
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 250,
##  interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.

Second-pass Logic:

# Third Model: Stochastic Gradient Boost Machine
# Second Pass: Grid search based on the parameter values selected in the first pass
myTrees = tune.gbm1$bestTune$n.trees; myTrees
## [1] 250
myDepth = tune.gbm1$bestTune$interaction.depth; myDepth
## [1] 5
myMinob = tune.gbm1$bestTune$n.minobsinnode; myMinob
## [1] 10

# Lower learning rates are preferred, however require a higher number of trees to
# model all the relationships and would be computationally expensive.
# Note: higher values of the following parameters may lead to overfitting:
# * Depth (5-8) should be chosen based on the number of observations and predictors. 
# * Shrinkage (0.05-0.2) controls the learning rate.
myGrd = expand.grid(n.trees = c(myTrees+1199),
                    interaction.depth = c(myDepth+1),
                    shrinkage = c(0.08),
                    n.minobsinnode = c(15:17))

system.time(
    # Fit model to the training set.
    tune.gbm2 <- train(classe ~ ., data = trn.trn, method="gbm", verbose=FALSE,
                       tuneGrid = myGrd, preProcess = myPrep, 
                       metric = myMetr, trControl = myCtrl)
)
##      user    system   elapsed 
## 39628.030    53.503  5821.597
tune.gbm2;
## Stochastic Gradient Boosting 
## 
## 11776 samples
##    25 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10599, 10598, 10599, 10599, 10598, 10599, ... 
## Resampling results across tuning parameters:
## 
##   n.minobsinnode  Accuracy   Kappa    
##   15              0.9649826  0.9557060
##   16              0.9643474  0.9549029
##   17              0.9658317  0.9567802
## 
## Tuning parameter 'n.trees' was held constant at a value of 1449
## 
## Tuning parameter 'interaction.depth' was held constant at a value of
##  6
## Tuning parameter 'shrinkage' was held constant at a value of 0.08
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 1449,
##  interaction.depth = 6, shrinkage = 0.08 and n.minobsinnode = 17.

Distributed Random Forest

This machine learning algorithm is distributed across an H2O cluster. Notice how the h2o library is used to initialize the H2O “cloud” instance, with a maximum number of threads and memory specified. This example leverages H2O, “intelligent” defaults. To learn how to complete repeated k-fold cross-validation using H2O, check out this H2O blog post.

First-pass Logic: The first pass runs with standard parameters and stopping rounds. Note that stopping rounds will stop model training when accuracy doesn’t improve for a specified number of training rounds, based on a simple moving average.

# Load H2O R package
library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following object is masked from 'package:pROC':
## 
##     var
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     ||, &&, %*%, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, %in%, is.character, is.factor, is.numeric,
##     log, log10, log1p, log2, round, signif, trunc

# Create an H2O cloud instance
h2o.conn = h2o.init(ip='localhost',      ## IP address of the H2O node
                    port=54321,          ## H2O port at the node IP
                    nthreads=-1,         ## -1: use all available threads
                    max_mem_size='56G')  ## specify the memory size for the H2O cloud
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/RtmpvtS3hM/h2o_roobyz_started_from_r.out
##     /tmp/RtmpvtS3hM/h2o_roobyz_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: ...... Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         6 seconds 216 milliseconds 
##     H2O cluster version:        3.10.2.2 
##     H2O cluster version age:    19 days  
##     H2O cluster name:           H2O_started_from_R_roobyz_dkz899 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   49.78 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.2 (2016-10-31)

# Clean slate - just in case the cluster was already running
h2o.removeAll()
## [1] 0

## Load dataframe into H2O cloud instance
trn.hex <- as.h2o(trn.trn, destination_frame="trn.hex")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
vld.hex <- as.h2o(trn.vld, destination_frame="vld.hex")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
tst.hex <- as.h2o(trn.tst, destination_frame="tst.hex")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

###############################################################################
## run our first predictive model
drf1 <- h2o.randomForest(        ## h2o.randomForest function
  training_frame = trn.hex,      ## the H2O frame for training
  validation_frame = vld.hex,    ## the H2O frame for validation (not required)
  x=1:25,                        ## the predictor columns, by column index
  y=26,                          ## the target index (what we are predicting)
  model_id = "rf_covType_v1",    ## name the model in H2O not required, but helps use Flow
  ntrees = 400,                  ## use a maximum of 400 trees to create the random forest model. The default is 50.
                                 ##  I have increased it because I will let the early stopping criteria decide when
                                 ##  the random forest is sufficiently accurate
  stopping_rounds = 2,           ## Stop fitting new trees when the 2-tree average is within 0.001 (default) of 
                                 ##  the prior two 2-tree averages. Can be thought of as a convergence setting
  score_each_iteration = T,      ## Predict against training and validation for each tree. Default will skip several.
  seed = 1732)                   ## Set the random seed so that this can be reproduced.
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=================================================================| 100%

###############################################################################
#summary(drf1)                    ## View information about the model.
                                 ## Keys to look for are validation performance
                                 ##  and variable importance

#drf1@model$validation_metrics    ## A more direct way to access the validation 
                                 ##  metrics. Performance metrics depend on 
                                 ##  the type of model being built. With a
                                 ##  multinomial classification, we will primarily
                                 ##  look at the confusion matrix, and overall
                                 ##  accuracy via hit_ratio @ k=1.

h2o.confusionMatrix(drf1)
## Confusion Matrix: vertical: actual; across: predicted
##           A    B    C    D    E  Error           Rate
## A      3259   31   24   27    7 0.0266 =   89 / 3,348
## B        63 2114   60   14   28 0.0724 =  165 / 2,279
## C        14   43 1933   42   22 0.0589 =  121 / 2,054
## D        12    7  106 1789   16 0.0731 =  141 / 1,930
## E         9   35   42   26 2053 0.0517 =  112 / 2,165
## Totals 3357 2230 2165 1898 2126 0.0533 = 628 / 11,776
h2o.hit_ratio_table(drf1, valid = T)[1,2] ## Even more directly, the hit_ratio @ k=1
## [1] 0.9620188
###############################################################################

Second-pass Logic: The second pass parameters were updated to increase the parameters for maximum depth and stopping tolerance/rounds.

###############################################################################
## run our second predictive model
drf2 <- h2o.randomForest(        ##
  training_frame = trn.hex,      ## the H2O frame for training
  validation_frame = vld.hex,    ## the H2O frame for validation (not required)
  x=1:25,                        ## the predictor columns, by column index
  y=26,                          ## the target index (what we are predicting)
  model_id = "rf_covType_v2",    ## name the model in H2O not required, but helps use Flow
  ntrees = 400,                  ## use a maximum of 400 trees to create the random forest model. The default is 50.
  max_depth = 30,                ## Increase depth, from 20
  stopping_rounds = 4,           ##
  stopping_tolerance = 2e-4,     ## switch from .001 to .0002
  score_each_iteration = T,      ##
  seed=1732)                     ##
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

###############################################################################
#summary(drf2)
h2o.confusionMatrix(drf1)
## Confusion Matrix: vertical: actual; across: predicted
##           A    B    C    D    E  Error           Rate
## A      3259   31   24   27    7 0.0266 =   89 / 3,348
## B        63 2114   60   14   28 0.0724 =  165 / 2,279
## C        14   43 1933   42   22 0.0589 =  121 / 2,054
## D        12    7  106 1789   16 0.0731 =  141 / 1,930
## E         9   35   42   26 2053 0.0517 =  112 / 2,165
## Totals 3357 2230 2165 1898 2126 0.0533 = 628 / 11,776
h2o.confusionMatrix(drf2)
## Confusion Matrix: vertical: actual; across: predicted
##           A    B    C    D    E  Error           Rate
## A      3264   32   23   24    5 0.0251 =   84 / 3,348
## B        47 2140   61   15   16 0.0610 =  139 / 2,279
## C         9   41 1953   32   19 0.0492 =  101 / 2,054
## D         9   18   87 1807    9 0.0637 =  123 / 1,930
## E         4   25   20   22 2094 0.0328 =   71 / 2,165
## Totals 3333 2256 2144 1900 2143 0.0440 = 518 / 11,776

h2o.hit_ratio_table(drf1, valid=T)[1,2]     ## original random forest accuracy
## [1] 0.9620188
h2o.hit_ratio_table(drf2, valid=T)[1,2]     ## newest random forest accuracy
## [1] 0.9666072


###############################################################################
drf.pred <- h2o.predict(object=drf2, newdata=tst.hex)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Glance at what that prediction set looks like
## We see a final prediction in the "predict" column,
##  and then the predicted probabilities per class.
drf.pred
##   predict A B C D E
## 1       A 1 0 0 0 0
## 2       A 1 0 0 0 0
## 3       A 1 0 0 0 0
## 4       A 1 0 0 0 0
## 5       A 1 0 0 0 0
## 6       A 1 0 0 0 0
## 
## [3923 rows x 6 columns]

## Compare these predictions to the accuracy we got from our experimentation
h2o.hit_ratio_table(drf2, valid=T)[1,2]     ## newest random forest accuracy
## [1] 0.9666072
mean(drf.pred$predict==tst.hex$classe)      ## test set accuracy
## [1] 0.9694112

print(h2o.auc(drf2, valid = TRUE))
## Warning in h2o.auc(drf2, valid = TRUE): No AUC for H2OMultinomialModel
## NULL

#h2o.download_pojo(drf2)

### All done, shutdown H2O    
h2o.shutdown(prompt=FALSE)
## [1] TRUE

Tuning Model Accuracy

After tuning these models we can identify the parameters that generate the best performance per model and compare how well the models perform against each other.

Summary

mcomp = rbind(getTrainPerf(tune.svr2),
              getTrainPerf(tune.rft2),
              getTrainPerf(tune.gbm2))

kable(mcomp)
TrainAccuracy TrainKappa method
0.9823909 0.9777244 svmRadial
0.9675927 0.9590027 rf
0.9658317 0.9567802 gbm

Support Vector

plot(tune.svr1, metric = "Kappa", main = "First Pass")

plot(tune.svr2, metric = "Kappa", main = "Second Pass")

Random Forest

plot(tune.rft1, metric = "Kappa", main = "First Pass")

plot(tune.rft2, metric = "Kappa", main = "Second Pass")

Gradient Boost

plot(tune.gbm1, metric = "Kappa", main = "First Pass")

plot(tune.gbm2, metric = "Kappa", main = "Second Pass")

Density Plot

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Compare Models
resamps <- resamples(list(
             '1.SVM.Radial'     = tune.svr2,
             '2.Random.Forest'  = tune.rft2,
             '3.Gradient.Boost' = tune.gbm2
           ))
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: 1.SVM.Radial, 2.Random.Forest, 3.Gradient.Boost 
## Number of resamples: 38 
## 
## Accuracy 
##                    Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## 1.SVM.Radial     0.9686  0.9796 0.9834 0.9824  0.9862 0.9890    0
## 2.Random.Forest  0.9559  0.9643 0.9677 0.9675  0.9718 0.9762    0
## 3.Gradient.Boost 0.9525  0.9643 0.9673 0.9668  0.9703 0.9762    0
## 
## Kappa 
##                    Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## 1.SVM.Radial     0.9602  0.9742 0.9791 0.9778  0.9825 0.9860    0
## 2.Random.Forest  0.9442  0.9549 0.9592 0.9589  0.9643 0.9699    0
## 3.Gradient.Boost 0.9399  0.9548 0.9586 0.9581  0.9624 0.9699    0

# Using Kappa as a more conservative measure for accuracy.
densityplot(resamps, metric='Kappa', 
            auto.key=list(space='right', 
                          trellis.par.set(superpose.line = list(lwd = 2.5, lty = 1))
                          )
            )


# Compare the lagged differences between the models.
mdiffs <- diff(resamps)
summary(mdiffs)
## 
## Call:
## summary.diff.resamples(object = mdiffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##                  1.SVM.Radial 2.Random.Forest 3.Gradient.Boost
## 1.SVM.Radial                  0.0149072       0.0155976       
## 2.Random.Forest  2.489e-14                    0.0006905       
## 3.Gradient.Boost < 2.2e-16    1                               
## 
## Kappa 
##                  1.SVM.Radial 2.Random.Forest 3.Gradient.Boost
## 1.SVM.Radial                  0.0188602       0.0197278       
## 2.Random.Forest  2.501e-14                    0.0008676       
## 3.Gradient.Boost < 2.2e-16    1

Fitting the Models

Now that we have the tuned parameters we select and fit the final models. This is really fast compared to the tuning steps.

# Retrive the training errors for the SVM model
err.svr <- tune.svr2$finalModel@error; err.svr
## [1] 0.003226902

# Retrieve the OOB error for the Random Forest model based on number of trees used
ntr     <- tune.rft2$finalModel$ntree; ntr
## [1] 500
err.rft <- tune.rft2$finalModel$err.rate[ntr,1]; err.rft
##       OOB 
## 0.0303159

# Identify the lowest error rate by number of trees
treerr.df <- as.data.frame(tune.rft2$finalModel$err.rate[,1])
treerr.df$trees <- row(treerr.df)
colnames(treerr.df) <- c("err","trees")

# Reduce the number of trees for the Random Forest based on error rate
ntr = as.vector(head(treerr.df[with(treerr.df, order(err)), ], 1)$trees); ntr
## [1] 374
err.rft <- tune.rft2$finalModel$err.rate[ntr, 1]; err.rft
##        OOB 
## 0.03014606

Support Vector

# Calculate the final model with the tuning parameters from the above.
mySig = tune.svr2$bestTune$sigma
myCst = tune.svr2$bestTune$C
newGrid = expand.grid(sigma=c(mySig), C=c(myCst))

mySig; myCst
## [1] 0.03
## [1] 123

system.time(
    # Fit model to the training set.
    fit.svr <- train(classe ~ ., data = trn.trn, method = "svmRadial", 
                     tuneGrid = newGrid, 
                     preProcess = myPrep, trControl = myCtrl)
)
##    user  system elapsed 
## 451.660 758.820 181.388

Random Forest

# Calculate the final model with the tuning parameters from the above.
ntr
## [1] 374

newGrid = expand.grid(mtry=c(tune.rft2$bestTune$mtry))
system.time(
    fit.rft <- train(classe ~ ., data = trn.trn, method="rf", 
                     tuneGrid = newGrid, ntree = ntr,
                     preProcess = myPrep, trControl = myCtrl)
)
##    user  system elapsed 
## 917.180 130.964 132.719

Gradient Boost

# Calculate the final model with the tuning parameters from the above.
myTrees = tune.gbm2$bestTune$n.trees
myDepth = tune.gbm2$bestTune$interaction.depth
myShrnk = tune.gbm2$bestTune$shrinkage
myMNode = tune.gbm2$bestTune$n.minobsinnode

myTrees; myDepth; myShrnk; myMNode
## [1] 1449
## [1] 6
## [1] 0.08
## [1] 17

myGrd = expand.grid(n.trees = c(myTrees),
                    interaction.depth = c(myDepth),
                    shrinkage = c(myShrnk),
                    n.minobsinnode = c(myMNode))

system.time(
    fit.gbm <- train(classe ~ ., data = trn.trn, method="gbm", verbose=FALSE,
                     tuneGrid = myGrd, preProcess = myPrep, 
                     metric = myMetr, trControl = myCtrl)
)
##      user    system   elapsed 
## 13436.916    15.847  2039.345

Separate Model Analysis & Validation

During this process we run the tuned models against our validation set to assess performance. The model performance statistics indicate reasonable performance and that we have not overfit the model.

# Assess the separate model accuracy on the validation set.
yhat.svr <- predict(fit.svr, trn.vld)
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
yhat.rft <- predict(fit.rft, trn.vld)
yhat.gbm <- predict(fit.gbm, trn.vld)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr

cm.svr <- confusionMatrix(yhat.svr, trn.vld$classe)
cm.rft <- confusionMatrix(yhat.rft, trn.vld$classe)
cm.gbm <- confusionMatrix(yhat.gbm, trn.vld$classe)

Support Vector

# Confusion Matrix
kable(cm.svr$table, caption = "Confusion Matrix:")
Confusion Matrix:
A B C D E
A 1110 7 2 4 0
B 6 750 3 0 3
C 0 2 673 16 3
D 0 0 6 622 0
E 0 0 0 1 715
# Class Summary
kable(t(cm.svr$byClass), caption = "Statistics by Class:")
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9946237 0.9881423 0.9839181 0.9673406 0.9916782
Specificity 0.9953687 0.9962073 0.9935165 0.9981707 0.9996877
Pos Pred Value 0.9884239 0.9842520 0.9697406 0.9904459 0.9986034
Neg Pred Value 0.9978571 0.9971528 0.9965934 0.9936267 0.9981291
Precision 0.9884239 0.9842520 0.9697406 0.9904459 0.9986034
Recall 0.9946237 0.9881423 0.9839181 0.9673406 0.9916782
F1 0.9915141 0.9861933 0.9767779 0.9787569 0.9951287
Prevalence 0.2844762 0.1934744 0.1743564 0.1639052 0.1837879
Detection Rate 0.2829467 0.1911802 0.1715524 0.1585521 0.1822585
Detection Prevalence 0.2862605 0.1942391 0.1769054 0.1600816 0.1825134
Balanced Accuracy 0.9949962 0.9921748 0.9887173 0.9827557 0.9956830

Random Forest

# Confusion Matrix
kable(cm.rft$table, caption = "Confusion Matrix:")
Confusion Matrix:
A B C D E
A 1103 17 3 4 1
B 3 731 6 0 6
C 7 11 667 30 6
D 2 0 7 608 4
E 1 0 1 1 704
# Class Summary
kable(t(cm.rft$byClass), caption = "Statistics by Class:")
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9883513 0.9631094 0.9751462 0.9455677 0.9764216
Specificity 0.9910937 0.9952592 0.9833282 0.9960366 0.9990631
Pos Pred Value 0.9778369 0.9798928 0.9251040 0.9790660 0.9957567
Neg Pred Value 0.9953488 0.9911867 0.9946908 0.9894004 0.9947139
Precision 0.9778369 0.9798928 0.9251040 0.9790660 0.9957567
Recall 0.9883513 0.9631094 0.9751462 0.9455677 0.9764216
F1 0.9830660 0.9714286 0.9494662 0.9620253 0.9859944
Prevalence 0.2844762 0.1934744 0.1743564 0.1639052 0.1837879
Detection Rate 0.2811624 0.1863370 0.1700229 0.1549834 0.1794545
Detection Prevalence 0.2875350 0.1901606 0.1837879 0.1582972 0.1802192
Balanced Accuracy 0.9897225 0.9791843 0.9792372 0.9708021 0.9877424

Gradient Boost

# Confusion Matrix
kable(cm.gbm$table, caption = "Confusion Matrix:")
Confusion Matrix:
A B C D E
A 1103 16 3 5 1
B 8 719 6 2 8
C 3 16 664 29 9
D 1 0 10 606 5
E 1 8 1 1 698
# Class Summary
kable(t(cm.gbm$byClass), caption = "Statistics by Class:")
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9883513 0.9472991 0.9707602 0.9424572 0.9680999
Specificity 0.9910937 0.9924147 0.9824020 0.9951220 0.9965646
Pos Pred Value 0.9778369 0.9676985 0.9209431 0.9742765 0.9844852
Neg Pred Value 0.9953488 0.9874214 0.9937539 0.9887913 0.9928438
Precision 0.9778369 0.9676985 0.9209431 0.9742765 0.9844852
Recall 0.9883513 0.9472991 0.9707602 0.9424572 0.9680999
F1 0.9830660 0.9573901 0.9451957 0.9581028 0.9762238
Prevalence 0.2844762 0.1934744 0.1743564 0.1639052 0.1837879
Detection Rate 0.2811624 0.1832781 0.1692582 0.1544736 0.1779251
Detection Prevalence 0.2875350 0.1893959 0.1837879 0.1585521 0.1807290
Balanced Accuracy 0.9897225 0.9698569 0.9765811 0.9687896 0.9823323

Ensemble Model Analysis & Comparison

Although the separate models perform well individually, we built this ensemble model to highlight the process and the potential benefit to predictive performance. Different algorithms will find some variables more predictive than others, and will rank them differently. This shows up in the variables of importance (see below). To get additional lift, it is important that the models being stacked have varying variables of importance, because the ensemble model will capitalize on those differences to boost predictive performance.

After fitting the separate models to our validation set, we create “modelled predictors” so that we can train the ensemble model. Then we assess the model performance against our test set. This “stacked” model should have improved predictive performance over the separate models.

set.seed(1732)
# Dataframe with predictions of the separate models, combined with the classe variable (from the validation set)
# in preparation for training the ensemble model against the "validation" data.
pred.vld <- data.frame(yhat.svr, yhat.rft, yhat.gbm, classe=trn.vld$classe)

# Train the ensemble model on the "stacked" predictors
# Note: since it was the fastest of the trio, the SVM algorithm used to build the ensemble
fit.stack <- train(classe ~ ., data=pred.vld, method="svmRadial",
                   metric = myMetr, trControl = myCtrl)

# Compare the ensemble model training performance to the separate models.
rbind(getTrainPerf(fit.stack),
      getTrainPerf(fit.svr),
      getTrainPerf(fit.rft),
      getTrainPerf(fit.gbm))
##   TrainAccuracy TrainKappa    method
## 1     0.9862357  0.9825861 svmRadial
## 2     0.9824536  0.9778050 svmRadial
## 3     0.9674126  0.9587721        rf
## 4     0.9650669  0.9558121       gbm

# To test the ensemble performance we create "stacked" predictors based
# on the test set for ensemble prediction assessment.
yhat.svr <- predict(fit.svr, trn.tst)
yhat.rft <- predict(fit.rft, trn.tst)
yhat.gbm <- predict(fit.gbm, trn.tst)

# Create "stacked" dataframe of modelled predicters and the classe variable (from the test set)
# in preparation for training the ensemble model against the "test" data.
pred.tst <- data.frame(yhat.svr, yhat.rft, yhat.gbm, classe=trn.tst$classe)

# Predict the model classe variable using the "stacked" predictors
yhat.stk <- predict(fit.stack, pred.tst)

# Compare the predicted class against the actual class variable.
cm.stk <- confusionMatrix(yhat.stk, trn.tst$classe)
str(cm.stk$byClass)
##  num [1:5, 1:11] 0.988 0.989 0.972 0.972 0.993 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:5] "Class: A" "Class: B" "Class: C" "Class: D" ...
##   ..$ : chr [1:11] "Sensitivity" "Specificity" "Pos Pred Value" "Neg Pred Value" ...

Variables of Importance

Below are the top 10 GBM variables in comparison to the other models (100=most important; 0=least important). Notice how the variables of importance vary across models.

# Compare the variables of importance.
viv = cbind(varImp(tune.gbm2, scale = TRUE)$importance,
            varImp(tune.rft2, scale = TRUE)$importance,
            varImp(tune.svr2, scale = TRUE)$importance)
colnames(viv)[1] = "Gradient_Boost"
colnames(viv)[2] = "Random_Forest"
colnames(viv)[3:7] = c("SVM-A","SVM-B","SVM-C","SVM-D","SVM-E")

kable(head(viv[with(viv, order(-Gradient_Boost)), ], 10),
      digits = 2)
Gradient_Boost Random_Forest SVM-A SVM-B SVM-C SVM-D SVM-E
PC8 100.00 100.00 72.94 100.00 72.94 72.94 100.00
PC5 96.62 73.61 23.18 14.06 26.35 38.01 23.18
PC14 92.25 91.56 20.36 20.12 2.27 48.45 20.36
PC12 81.61 83.33 40.61 28.45 28.45 41.35 40.61
PC1 64.87 81.25 17.10 8.31 8.31 39.42 17.10
PC2 58.37 55.95 21.86 14.14 18.26 41.84 21.86
PC3 55.68 69.51 71.44 78.40 76.92 71.44 78.40
PC9 51.42 63.22 43.27 43.27 43.27 43.27 2.16
PC15 46.77 58.91 33.03 32.01 32.01 32.01 33.03
PC25 43.56 41.62 51.54 51.54 60.59 51.54 47.07

Model Correlation

Similar to searching for multicolinearity in the predictors of the separate models, we need to check if modelled predictors are fairly uncorrelated. Ideally, the modelled predictors should have less than 75% correlation.

results <- resamples(list(
             SVM.Radial     = fit.svr,
             Random.Forest  = fit.rft,
             Gradient.Boost = fit.gbm
           ))

# check if model predicotrs are fairly un-correlated (< 0.75)
kable(modelCor(results), caption = "Model Correlation Matrix:")
Model Correlation Matrix:
SVM.Radial Random.Forest Gradient.Boost
SVM.Radial 1.0000000 0.0139791 0.0701732
Random.Forest 0.0139791 1.0000000 0.2294552
Gradient.Boost 0.0701732 0.2294552 1.0000000

Density Plots

Although the Gradient Boost and Random Forest models do not perform as well as the Support Vector Machine model, the ensemble algorithm is able to learn from their variance to perform better than all three of the separate models.

results <- resamples(list(
             '1.SVM.Radial'     = fit.svr,
             '2.Random.Forest'  = fit.rft,
             '3.Gradient.Boost' = fit.gbm,
             '4.Ensemble.Stack' = fit.stack
           ))

# Summarize the ensemble results
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: 1.SVM.Radial, 2.Random.Forest, 3.Gradient.Boost, 4.Ensemble.Stack 
## Number of resamples: 80 
## 
## Accuracy 
##                    Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## 1.SVM.Radial     0.9745  0.9796 0.9830 0.9825  0.9849 0.9890    0
## 2.Random.Forest  0.9550  0.9635 0.9677 0.9674  0.9711 0.9788    0
## 3.Gradient.Boost 0.9474  0.9609 0.9652 0.9651  0.9686 0.9771    0
## 4.Ensemble.Stack 0.9720  0.9822 0.9872 0.9862  0.9898 0.9974    0
## 
## Kappa 
##                    Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## 1.SVM.Radial     0.9678  0.9742 0.9785 0.9778  0.9809 0.9860    0
## 2.Random.Forest  0.9431  0.9538 0.9591 0.9588  0.9635 0.9731    0
## 3.Gradient.Boost 0.9334  0.9506 0.9560 0.9558  0.9603 0.9710    0
## 4.Ensemble.Stack 0.9646  0.9775 0.9839 0.9826  0.9871 0.9968    0
lapply(fit.stack, summary)
## $method
##    Length     Class      Mode 
##         1 character character 
## 
## $modelInfo
##            Length Class      Mode     
## label      1      -none-     character
## library    1      -none-     character
## type       2      -none-     character
## parameters 3      data.frame list     
## grid       1      -none-     function 
## loop       0      -none-     NULL     
## fit        1      -none-     function 
## predict    1      -none-     function 
## prob       1      -none-     function 
## predictors 1      -none-     function 
## tags       4      -none-     character
## levels     1      -none-     function 
## sort       1      -none-     function 
## 
## $modelType
##    Length     Class      Mode 
##         1 character character 
## 
## $results
##      sigma               C             Accuracy          Kappa       
##  Min.   :0.03751   Min.   :0.2500   Min.   :0.9859   Min.   :0.9821  
##  1st Qu.:0.03751   1st Qu.:0.3750   1st Qu.:0.9859   1st Qu.:0.9822  
##  Median :0.03751   Median :0.5000   Median :0.9859   Median :0.9822  
##  Mean   :0.03751   Mean   :0.5833   Mean   :0.9860   Mean   :0.9823  
##  3rd Qu.:0.03751   3rd Qu.:0.7500   3rd Qu.:0.9861   3rd Qu.:0.9824  
##  Max.   :0.03751   Max.   :1.0000   Max.   :0.9862   Max.   :0.9826  
##    AccuracySD          KappaSD        
##  Min.   :0.005437   Min.   :0.006878  
##  1st Qu.:0.005449   1st Qu.:0.006895  
##  Median :0.005462   Median :0.006911  
##  Mean   :0.005461   Mean   :0.006909  
##  3rd Qu.:0.005473   3rd Qu.:0.006925  
##  Max.   :0.005483   Max.   :0.006938  
## 
## $pred
## Length  Class   Mode 
##      0   NULL   NULL 
## 
## $bestTune
##      sigma               C      
##  Min.   :0.03751   Min.   :0.5  
##  1st Qu.:0.03751   1st Qu.:0.5  
##  Median :0.03751   Median :0.5  
##  Mean   :0.03751   Mean   :0.5  
##  3rd Qu.:0.03751   3rd Qu.:0.5  
##  Max.   :0.03751   Max.   :0.5  
## 
## $call
## Length  Class   Mode 
##      6   call   call 
## 
## $dots
## Length  Class   Mode 
##      0   list   list 
## 
## $metric
##    Length     Class      Mode 
##         1 character character 
## 
## $control
##                   Length Class  Mode     
## method             1     -none- character
## number             1     -none- numeric  
## repeats            1     -none- numeric  
## search             1     -none- character
## p                  1     -none- numeric  
## initialWindow      0     -none- NULL     
## horizon            1     -none- numeric  
## fixedWindow        1     -none- logical  
## skip               1     -none- numeric  
## verboseIter        1     -none- logical  
## returnData         1     -none- logical  
## returnResamp       1     -none- character
## savePredictions    1     -none- character
## classProbs         1     -none- logical  
## summaryFunction    1     -none- function 
## selectionFunction  1     -none- character
## preProcOptions     6     -none- list     
## sampling           0     -none- NULL     
## index             80     -none- list     
## indexOut          80     -none- list     
## indexFinal         0     -none- NULL     
## timingSamps        1     -none- numeric  
## predictionBounds   2     -none- logical  
## seeds             81     -none- list     
## adaptive           4     -none- list     
## trim               1     -none- logical  
## allowParallel      1     -none- logical  
## 
## $finalModel
## Length  Class   Mode 
##      1   ksvm     S4 
## 
## $preProcess
## Length  Class   Mode 
##      0   NULL   NULL 
## 
## $trainingData
##  .outcome yhat.svr yhat.rft yhat.gbm
##  A:1116   A:1123   A:1128   A:1128  
##  B: 759   B: 762   B: 746   B: 743  
##  C: 684   C: 694   C: 721   C: 721  
##  D: 643   D: 628   D: 621   D: 622  
##  E: 721   E: 716   E: 707   E: 709  
## 
## $resample
##     Accuracy          Kappa          Resample        
##  Min.   :0.9720   Min.   :0.9646   Length:80         
##  1st Qu.:0.9822   1st Qu.:0.9775   Class :character  
##  Median :0.9872   Median :0.9839   Mode  :character  
##  Mean   :0.9862   Mean   :0.9826                     
##  3rd Qu.:0.9898   3rd Qu.:0.9871                     
##  Max.   :0.9974   Max.   :0.9968                     
## 
## $resampledCM
##      cell1           cell2            cell3             cell4       
##  Min.   :108.0   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:110.0   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :111.0   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :110.7   Mean   :0.6167   Mean   :0.05833   Mean   :0.1375  
##  3rd Qu.:112.0   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :112.0   Max.   :3.0000   Max.   :1.00000   Max.   :2.0000  
##      cell5             cell6            cell7           cell8       
##  Min.   :0.00000   Min.   :0.0000   Min.   :73.00   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:74.75   1st Qu.:0.0000  
##  Median :0.00000   Median :1.0000   Median :75.00   Median :0.0000  
##  Mean   :0.04583   Mean   :0.6833   Mean   :75.00   Mean   :0.2167  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:76.00   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :3.0000   Max.   :76.00   Max.   :2.0000  
##      cell9       cell10      cell11        cell12        cell13    
##  Min.   :0   Min.   :0   Min.   :0.0   Min.   :0.0   Min.   :65.0  
##  1st Qu.:0   1st Qu.:0   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:66.0  
##  Median :0   Median :0   Median :0.0   Median :0.0   Median :67.0  
##  Mean   :0   Mean   :0   Mean   :0.2   Mean   :0.3   Mean   :67.3  
##  3rd Qu.:0   3rd Qu.:0   3rd Qu.:0.0   3rd Qu.:1.0   3rd Qu.:68.0  
##  Max.   :0   Max.   :0   Max.   :1.0   Max.   :2.0   Max.   :69.0  
##      cell14        cell15      cell16           cell17      cell18     
##  Min.   :0.0   Min.   :0   Min.   :0.0000   Min.   :0   Min.   :0.000  
##  1st Qu.:0.0   1st Qu.:0   1st Qu.:0.0000   1st Qu.:0   1st Qu.:1.000  
##  Median :0.0   Median :0   Median :0.0000   Median :0   Median :2.000  
##  Mean   :0.6   Mean   :0   Mean   :0.3167   Mean   :0   Mean   :1.608  
##  3rd Qu.:1.0   3rd Qu.:0   3rd Qu.:1.0000   3rd Qu.:0   3rd Qu.:2.000  
##  Max.   :3.0   Max.   :0   Max.   :2.0000   Max.   :0   Max.   :5.000  
##      cell19          cell20        cell21      cell22        cell23   
##  Min.   :58.00   Min.   :0.0   Min.   :0   Min.   :0.0   Min.   :0.0  
##  1st Qu.:62.00   1st Qu.:0.0   1st Qu.:0   1st Qu.:0.0   1st Qu.:0.0  
##  Median :62.00   Median :0.0   Median :0   Median :0.0   Median :0.0  
##  Mean   :62.27   Mean   :0.1   Mean   :0   Mean   :0.3   Mean   :0.3  
##  3rd Qu.:63.00   3rd Qu.:0.0   3rd Qu.:0   3rd Qu.:1.0   3rd Qu.:1.0  
##  Max.   :64.00   Max.   :1.0   Max.   :0   Max.   :2.0   Max.   :2.0  
##      cell24      cell25         sigma               C         
##  Min.   :0   Min.   :69.0   Min.   :0.03751   Min.   :0.2500  
##  1st Qu.:0   1st Qu.:71.0   1st Qu.:0.03751   1st Qu.:0.2500  
##  Median :0   Median :72.0   Median :0.03751   Median :0.5000  
##  Mean   :0   Mean   :71.5   Mean   :0.03751   Mean   :0.5833  
##  3rd Qu.:0   3rd Qu.:72.0   3rd Qu.:0.03751   3rd Qu.:1.0000  
##  Max.   :0   Max.   :73.0   Max.   :0.03751   Max.   :1.0000  
##    Resample        
##  Length:240        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
## $perfNames
##    Length     Class      Mode 
##         2 character character 
## 
## $maximize
##    Mode    TRUE    NA's 
## logical       1       0 
## 
## $yLimits
## Length  Class   Mode 
##      0   NULL   NULL 
## 
## $times
##            Length Class     Mode   
## everything 5      proc_time numeric
## final      5      proc_time numeric
## prediction 3      -none-    logical
## 
## $levels
##    Length     Class      Mode 
##         5 character character 
## 
## $terms
##  Length  Class1  Class2    Mode 
##       3   terms formula    call 
## 
## $coefnames
##    Length     Class      Mode 
##        12 character character 
## 
## $contrasts
##          Length Class  Mode     
## yhat.svr 1      -none- character
## yhat.rft 1      -none- character
## yhat.gbm 1      -none- character
## 
## $xlevels
##          Length Class  Mode     
## yhat.svr 5      -none- character
## yhat.rft 5      -none- character
## yhat.gbm 5      -none- character

# Using Kappa as a more conservative measure for accuracy.
densityplot(results, metric='Kappa', 
            auto.key=list(space='right', 
                          trellis.par.set(superpose.line = list(lwd = 2.5, lty = 1))
                          )
            )

Box Plots

The box plots indicate that the ensemble model has reduced bias and increased variance, however the variance trends higher in accuracy. In other words, the ensemble model has expected predictions that are closer to the truth when compared to any of the separate models.

# box and whisker plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(results, scales=scales)

#bwplot(results, metric="ROC")

Per-Class Statistics

# Confusion Matrix
kable(cm.stk$table, caption = "Confusion Matrix:")
Confusion Matrix:
A B C D E
A 1103 5 2 0 0
B 11 751 9 0 1
C 0 1 665 17 4
D 1 1 7 625 0
E 1 1 1 1 716
# Class Summary
kable(t(cm.stk$byClass), caption = "Statistics by Class:")
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9883513 0.9894598 0.9722222 0.9720062 0.9930652
Specificity 0.9975062 0.9933628 0.9932078 0.9972561 0.9987508
Pos Pred Value 0.9936937 0.9727979 0.9679767 0.9858044 0.9944444
Neg Pred Value 0.9953786 0.9974611 0.9941286 0.9945272 0.9984390
Precision 0.9936937 0.9727979 0.9679767 0.9858044 0.9944444
Recall 0.9883513 0.9894598 0.9722222 0.9720062 0.9930652
F1 0.9910153 0.9810581 0.9700948 0.9788567 0.9937543
Prevalence 0.2844762 0.1934744 0.1743564 0.1639052 0.1837879
Detection Rate 0.2811624 0.1914351 0.1695131 0.1593168 0.1825134
Detection Prevalence 0.2829467 0.1967882 0.1751211 0.1616110 0.1835330
Balanced Accuracy 0.9929287 0.9914113 0.9827150 0.9846312 0.9959080

Conclusion

Our statistics, including accuracy, precision and recall, indicate that the ensemble model is well generalized and not overfit to the training data. The ensemble model’s estimated training accuracy (98.62%) shows a 0.38% predictive improvement over the SVM model (98.25%).

Let’s see how the ensemble performs compared to the separate models against the test data:

  • Ensemble: 98.39%
  • SVM-Radial: 98.34%
  • Random Forest: 97.43%
  • Gradient Boost: 97.12%

In addition to the improved estimated mean performance, the ensemble model has estimated per-class performance that is equal-to or better than the separate models:

# Assess how the ensemble gets higher accuracy
# Compare combined predictor accuracy to the producer’s accuracy (aka sensitivity or recall) for the classes
pred.perf <- rbind(confusionMatrix(yhat.svr, trn.tst$classe)$byClass[, 7],
                   confusionMatrix(yhat.rft, trn.tst$classe)$byClass[, 7],
                   confusionMatrix(yhat.gbm, trn.tst$classe)$byClass[, 7],
                   confusionMatrix(yhat.stk, trn.tst$classe)$byClass[, 7])
row.names(pred.perf) <- c("SVM.Radial", "Random.Forest", "Gradient.Boost", "Ensemble.Stack")
# Class Summary
kable(pred.perf, caption = "Per-class Weigthed Average of Precision and Recall (F-1):")
Per-class Weigthed Average of Precision and Recall (F-1):
Class: A Class: B Class: C Class: D Class: E
SVM.Radial 0.9901257 0.9810581 0.9700948 0.9772549 0.9937543
Random.Forest 0.9839429 0.9648308 0.9561467 0.9693637 0.9909281
Gradient.Boost 0.9829443 0.9606815 0.9486623 0.9702194 0.9867411
Ensemble.Stack 0.9910153 0.9810581 0.9700948 0.9788567 0.9937543

Technical Notes: Receiver Operating Characteristic (ROC) curve analysis was not designed for multi-classification problems, however using the pROC library, we calculate the mean area under the curve (AUC) on the test data of 99.37%, which cannot be plotted. To learn more about ROC curve analysis, have a look at this BioMed Central (BMC) research paper. “Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77”.



The code and content in this post is licensed by Roberto Rivera to the public
under a Creative Commons Attribution 4.0 License