Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
Class A : exactly according to the specification
Class B : throwing the elbows to the front
Class C : lifting the dumbbell only halfway
Class D : lowering the dumbbell only halfway
Class E : throwing the hips to the front
Predict the manner in which they did the exercise.
Training data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Test data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Outcome variable “CLASSE” has a factor data structure (A,B,C,D,E). This variable, a categorical data will be best predicted using any classifier predictive method. Since the Coursera course have used random forest extensively, I will start with using Random forest and compare it against SVM using C-classification.
I will be using K-fold cross validation with k = 3 value. Using the for loop, the program will create 3 distinct sets of training and test data. This will produce 3 model fit values which will be averaged to establish an average accuracy.
Ideally out of sample error rate should be 0% and accuracy rate to be 100%. In predictive data analysis anywhere between 70% and 99% accuracy and 30% to 1% out of sample error rate or MISCLASSIFICATION RATE is acceptable.
To get the out of sample error rate: 1 - accuracy rate
Lets start by loading all the libraries
library(caret)
library(readr)
library(mlbench)
library(e1071)
Read the training and test data
allData <- read_csv('pml-training.csv',col_names = TRUE,na=c("NA","#DIV/0!", ""))
allDataTest <- read_csv('pml-testing.csv',col_names = TRUE,na=c("NA","#DIV/0!", ""))
Clean the Data by using only column with no missing values
allData <-allData[,colSums(is.na(allData)) == 0]
allDataTest <-allDataTest[,colSums(is.na(allDataTest)) == 0]
Take out unnecessary columns that has no corelation to the outcome variable
allData <-allData[,-c(1:7)]
allDataTest <-allDataTest[,-c(1:7)]
Show the structure of CLASSE variable
class(allData$classe)
## [1] "character"
table(allData$classe)# ABCDE
##
## A B C D E
## 5580 3797 3422 3216 3607
Now that we have establish the data structure for the Classe variable and confirm that its a categorical data, we will use Random forest and SVM to build a model.
Before we begin to model in Random forest,lets load parallel processing to speed up the results.
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
#Configure trainControl object
fitControl <- trainControl(method = "cv", number = 3, allowParallel = TRUE)
Start Cross validation
set.seed(333)
folds <- createFolds(allData$classe,k= 3)
Accu1_array <- array()## accuracy value storage
Create 3 distinct data sets, get the model, predict and get accuracy for each iteration
for(i in 1:3)
{
data_train <- allData[(-folds[[i]]),]
data_test <- allData[(folds[[i]]),]
fit_rf <- train(factor(classe) ~., method="rf", data= data_train, trControl = fitControl)
pred_rf <- predict(fit_rf,data_test)
cm <- confusionMatrix(pred_rf,data_test$classe)
Accu1_array[i] <- as.numeric(cm$overall['Accuracy'])
}
Stop parallel instance and plot predicted classe outcome
stopCluster(cluster)
registerDoSEQ()
qplot(pred_rf,factor(data_test$classe), xlab = "predicted", ylab = "actual", colour= data_test$classe,geom = c("boxplot", "jitter"))
Get mean of the 3 accuracy and show Out of Sample Rate
mean(Accu1_array) #93% SVM 99% for RF
## [1] 0.9923048
1- mean(Accu1_array)
## [1] 0.007695247
As you can see, mean accuracy of the 3 Random forest model is 99.2% with .7% out of sample error.
Now lets try SVM model.
Start Cross validation
set.seed(999)
folds <- createFolds(allData$classe,k= 3)
Accu2_array <- array()## accuracy value storage
Create 3 distinct data sets, get the model, predict and get accuracy for each iteration
for(i in 1:3)
{
data_train <- allData[(-folds[[i]]),]
data_test <- allData[(folds[[i]]),]
fit_svm <- svm(classe ~., method="svm", data= data_train, type= "C-classification")
pred_svm <- predict(fit_svm,data_test)
cm <- confusionMatrix(pred_svm,data_test$classe)
Accu2_array[i] <- as.numeric(cm$overall['Accuracy'])
}
Plot predicted SVM classe
qplot(pred_svm,factor(data_test$classe), xlab = "predicted", ylab = "actual", colour= data_test$classe,geom = c("boxplot", "jitter"))
Get mean of the 3 accuracy and show Out of Sample Rate.
mean(Accu2_array) #93% SVM 99% for RF
## [1] 0.9337992
1- mean(Accu2_array)
## [1] 0.06620082
Mean accuracy of the 3 SVM model is 93.3% with 6.6% out of sample error.
Clearly Random forest model with 99% accuracy is the most accurate.
So lets apply this final RF model to predict the AllDataTest with 20 rows.
final_rf <- predict(fit_rf,allDataTest )
final_rf
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Credits: IGreski for the parallel processing: https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md
DATASET: Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H.Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. (DATASET)