People regularly quantify how much of a particular activity they do, but they rarely quantify how well that same activity is performed. More often than not discerning the quality of a work out requires specialized supervision of a personal trainer.
Have you ever imagined a scenario in which your training equipment would play the role of your personal trainer?
This is actually what this whole analysis is all about. We predict how well people exercise based on data produced by accelarators attached to their belt, forearm, arm, and dumbell.
The overall quality in which people exercise is given by the “classe” variable in the training set. Classe ‘A’ indicates an exercise perfomed correctly (all kudos to you, athlete). The other classes indicate common exercizing mistakes.
All credits for data collection and original analysis go to the Human Activity Recognition - HAR laboratory, previously detailed in this paper. Credits for educational notes go to the Johns Hopkins School of Biostatistics.
For this analysis a number of libraries are required. If you intend to replicate this report make sure you use the same list of dependencies:
pacman::p_load(knitr, caret, e1071, mlbench, ggplot2, ISLR, Hmisc, gridExtra, RANN,
AppliedPredictiveModeling, corrplot, randomForest, gbm, splines, parallel,
plyr, dplyr, leaps, MASS, pROC, C50,
# parallel support
doMC)
The first step is downloading the data required for this exercise. The training and check data sets are available respectively at url
and urlTesting
locations:
url <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
file <- 'pml-training.csv'
urlTesting <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
testingFile <- 'pml-testing.csv'
download.file(urlTesting, destfile=testingFile, method="curl")
download.file(url, destfile=file, method="curl")
From initial visual inspection we identified a few issues we will have to address at the time we will read them.
NA
, some other times by empty strings and some other times by random strings like “#DIV/0!”. The vetor na.strings
represents all values in the dataset that should be treated as NA
.grepExpr
.shrinkage
factor to artificially reduce the dataframe and lower the processing times during iterations of model development. On the final version, shrinkage should be set to 1.0
.na.strings <- c('','.','NA','#DIV/0!')
grepExpr <- paste(c('.*_dumbbell', '.*_arm', '.*_belt', '.*_forearm', 'classe'), collapse='|')
shrinkage <- 1.0
We only load columns from testing in testingRaw
if all these conditions are true:
NA
na.strings
testingRaw <- read.csv(testingFile, na.strings=na.strings)
testingRaw <- testingRaw[, colSums(is.na(testingRaw)) != nrow(testingRaw)] # at least one value != NA
testingRaw <- testingRaw[,grep(grepExpr, names(testingRaw))]
We only load a training column from trainingRaw
if the column is also defined in testing data set and loaded in testingRaw
:
trainingRaw <- read.csv(file, na.strings=na.strings)
trainingRaw <- trainingRaw[,c(colnames(testingRaw), 'classe')]
if (shrinkage < 1.0) {
set.seed(123)
index <- createDataPartition(trainingRaw$classe, p=shrinkage, list=FALSE)
trainingRaw <- trainingRaw[index,]
}
The data is partitioned into a 75/25% split between training and testing for training, cross-validation and in/out sample testing using the method createDataPartition
from caret
:
set.seed(123)
inTraining <- createDataPartition(trainingRaw$classe, p=.75, list=FALSE)
training <- trainingRaw[inTraining,]
dim(training)
## [1] 14718 53
testing <- trainingRaw[-inTraining,]
dim(testing)
## [1] 4904 53
We replace NA
values by the average of K nearest neighbors of that missing value. The K nearest neighbors imputation algorithm is applied only on numerical columns of the training set. We use the caret
function preProcess
with method='knnImpute'
:
numerical <- training[, sapply(training, is.numeric)]
set.seed(123)
pp <- preProcess(numerical, method=c('knnImpute'))
trainingImputed <- predict(pp, numerical)
trainingImputed$classe <- training$classe
We opted to use an explicit feature selection model, random forest, in which feature selection is performed through cross validation.
We train our model random forests model, method='rf'
, over 6-folds cross validation (method='cv'
in trainControl
) using parallel processsing whenever available:
registerDoMC(6) # parallel support
set.seed(123)
fitForward <- train(classe ~ ., data=trainingImputed, method='rf',
trControl=trainControl(method='cv',
number=6,
allowParallel=TRUE
)
)
Despite of the absence of automatic feature selection, the random forest classification algorithm does keep track of a ranking of how well each feature collaborate to the outcome on each class. We call this ranking importance
, retrieved through the method varImp
in caret
.
importance <- varImp(fitForward)
The cross correlation of the top 4 features is given in a feature plot.
transparentTheme(trans = .3)
featurePlot(x=trainingImputed[,rownames(head(importance$importance,4))],
y=trainingImputed$classe,
plot='pairs',
auto.key=list(columns=5)
)
We can visually detect a number of clusters for each of the classes A, B, C, D and E, as well as several clear linear relationships - a clear hint that this model should perform well.
Specifically the 15 top features on this model…
head(importance$importance, 15)
## Overall
## roll_belt 100.000000
## pitch_belt 61.054130
## yaw_belt 79.056338
## total_accel_belt 23.793061
## gyros_belt_x 6.927417
## gyros_belt_y 7.151216
## gyros_belt_z 32.320001
## accel_belt_x 10.707923
## accel_belt_y 10.639753
## accel_belt_z 44.004848
## magnet_belt_x 24.084942
## magnet_belt_y 43.564740
## magnet_belt_z 43.070473
## roll_arm 33.679146
## pitch_arm 17.851406
…proportionally stacked visually:
plot(importance, top=20)
We will measure and track in and out sample error by means of comparison of a confusion matrix against the training partition and a testing partition cented and scaled around metrics of the training partition.
We will use the definition that in-sample error is the diffence between your prediction and actuals for data points used to build the model.
We should expect the in-sample error to be 0%, equivalent to 100% accuracy, what indicates the bias of the in-sample error measurement:
isPred <- predict(fitForward, trainingImputed)
cm <- confusionMatrix(trainingImputed$classe, isPred)
cm$table
## Reference
## Prediction A B C D E
## A 4185 0 0 0 0
## B 0 2848 0 0 0
## C 0 0 2567 0 0
## D 0 0 0 2412 0
## E 0 0 0 0 2706
As expected, no misses. In-sample error matches the theoretical expectation of 0%.
For the out-sample errors, since the KNN imputation scales and centers the training data set, we need to center the measurements on the testing set around the mean on the training partition, and scale to the standard deviation of the same training partition This is performed by the function centerOnReference
.
numericalTesting <- testing[, sapply(testing, is.numeric)]
testingImputed <- centerOnReference(numericalTesting, training)
testingImputed$classe <- testing$classe
osPred <- predict(fitForward, testingImputed)
cm <- confusionMatrix(testing$classe, osPred)
cm$table
## Reference
## Prediction A B C D E
## A 1394 1 0 0 0
## B 2 945 2 0 0
## C 0 8 846 1 0
## D 0 0 19 783 2
## E 0 0 0 1 900
For this model, for an accuracy of 99.27% the out-sample error is 0.73%, with a confidence interval of 98.99% to 99.49%.
The complete confusion matrix is given by:
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 1 0 0 0
## B 2 945 2 0 0
## C 0 8 846 1 0
## D 0 0 19 783 2
## E 0 0 0 1 900
##
## Overall Statistics
##
## Accuracy : 0.9927
## 95% CI : (0.9899, 0.9949)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9907
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9986 0.9906 0.9758 0.9975 0.9978
## Specificity 0.9997 0.9990 0.9978 0.9949 0.9998
## Pos Pred Value 0.9993 0.9958 0.9895 0.9739 0.9989
## Neg Pred Value 0.9994 0.9977 0.9948 0.9995 0.9995
## Prevalence 0.2847 0.1945 0.1768 0.1601 0.1839
## Detection Rate 0.2843 0.1927 0.1725 0.1597 0.1835
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9991 0.9948 0.9868 0.9962 0.9988
In a simple and quick fitting we were able to get very close to the weighted average of the baseline accuracy of 99.4%. Despite of the numerical proximity of the results, we can see the baseline is on the upper boundary of the confidence interval of this study:
## AccuracyUpper
## 99.49
We were limited in terms of computing resources and time (this analysis was performed beginning to end in about 3 hours). If we had more time we could try ensemble methods for classifications, specifically AdaBoost
, but that would be beyond the intent and time allocated for this exercise.
If you want to check a more elaborate analysis you can either check the original paper or refer to a longer version of this study, where we list several techniques and options for investigation over the same raw data.
Session Configuration Details (reproducibility of this analysis)
sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel splines stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] doMC_1.3.4 iterators_1.0.8
## [3] foreach_1.4.3 C50_0.1.0-24
## [5] pROC_1.8 MASS_7.3-45
## [7] leaps_2.9 dplyr_0.4.3
## [9] plyr_1.8.3 gbm_2.1.1
## [11] randomForest_4.6-12 corrplot_0.77
## [13] AppliedPredictiveModeling_1.1-6 RANN_2.5
## [15] gridExtra_2.2.1 Hmisc_3.17-4
## [17] Formula_1.2-1 survival_2.38-3
## [19] ISLR_1.0 mlbench_2.1-1
## [21] e1071_1.6-7 caret_6.0-68
## [23] ggplot2_2.1.0 lattice_0.20-33
## [25] knitr_1.12.3 pacman_0.4.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.4 class_7.3-14 assertthat_0.1
## [4] digest_0.6.9 R6_2.1.2 chron_2.3-47
## [7] acepack_1.3-3.3 MatrixModels_0.4-1 stats4_3.2.4
## [10] evaluate_0.9 minqa_1.2.4 data.table_1.9.6
## [13] SparseM_1.7 car_2.1-2 nloptr_1.0.4
## [16] partykit_1.0-5 rpart_4.1-10 Matrix_1.2-4
## [19] rmarkdown_0.9.6 lme4_1.1-12 stringr_1.0.0
## [22] foreign_0.8-66 munsell_0.4.3 mgcv_1.8-12
## [25] htmltools_0.3.5 nnet_7.3-12 CORElearn_1.47.1
## [28] codetools_0.2-14 grid_3.2.4 nlme_3.1-125
## [31] gtable_0.2.0 DBI_0.4 magrittr_1.5
## [34] formatR_1.3 scales_0.4.0 stringi_1.0-1
## [37] reshape2_1.4.1 latticeExtra_0.6-28 RColorBrewer_1.1-2
## [40] tools_3.2.4 pbkrtest_0.4-6 yaml_2.1.13
## [43] colorspace_1.2-6 cluster_2.0.3 quantreg_5.21