Machine Learning Project

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

Class A : exactly according to the specification
Class B : throwing the elbows to the front
Class C : lifting the dumbbell only halfway
Class D : lowering the dumbbell only halfway
Class E : throwing the hips to the front

Goal:

Predict the manner in which they did the exercise.

Dataset

Training data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

Test data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Model Methodology

Outcome variable “CLASSE” has a factor data structure (A,B,C,D,E). This variable, a categorical data will be best predicted using any classifier predictive method. Since the Coursera course have used random forest extensively, I will start with using Random forest and compare it against SVM using C-classification.

Cross Validation

I will be using K-fold cross validation with k = 3 value. Using the for loop, the program will create 3 distinct sets of training and test data. This will produce 3 model fit values which will be averaged to establish an average accuracy.

Expected out of sample error

Ideally out of sample error rate should be 0% and accuracy rate to be 100%. In predictive data analysis anywhere between 70% and 99% accuracy and 30% to 1% out of sample error rate or MISCLASSIFICATION RATE is acceptable.

To get the out of sample error rate: 1 - accuracy rate

Lets start by loading all the libraries

library(caret)
library(readr)
library(mlbench)
library(e1071)

Read the training and test data

allData <- read_csv('pml-training.csv',col_names = TRUE,na=c("NA","#DIV/0!", "")) 
allDataTest <- read_csv('pml-testing.csv',col_names = TRUE,na=c("NA","#DIV/0!", ""))

Clean the Data by using only column with no missing values

allData <-allData[,colSums(is.na(allData)) == 0]
allDataTest <-allDataTest[,colSums(is.na(allDataTest)) == 0]

Take out unnecessary columns that has no corelation to the outcome variable

allData   <-allData[,-c(1:7)]
allDataTest <-allDataTest[,-c(1:7)]

Show the structure of CLASSE variable

class(allData$classe)

## [1] "character"

table(allData$classe)# ABCDE

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Now that we have establish the data structure for the Classe variable and confirm that its a categorical data, we will use Random forest and SVM to build a model.

Before we begin to model in Random forest,lets load parallel processing to speed up the results.

library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
#Configure trainControl object
fitControl <- trainControl(method = "cv", number = 3, allowParallel = TRUE)

Start Cross validation

set.seed(333)
folds <- createFolds(allData$classe,k= 3)
Accu1_array <- array()## accuracy value storage

Create 3 distinct data sets, get the model, predict and get accuracy for each iteration

for(i in 1:3)
{ 
    data_train <- allData[(-folds[[i]]),]
    data_test  <- allData[(folds[[i]]),]
    fit_rf <- train(factor(classe) ~., method="rf", data= data_train, trControl = fitControl)
    pred_rf <- predict(fit_rf,data_test)
    cm <- confusionMatrix(pred_rf,data_test$classe)
    Accu1_array[i] <- as.numeric(cm$overall['Accuracy'])
}

Stop parallel instance and plot predicted classe outcome

stopCluster(cluster)
registerDoSEQ()
qplot(pred_rf,factor(data_test$classe), xlab = "predicted", ylab = "actual", colour= data_test$classe,geom = c("boxplot", "jitter"))

Get mean of the 3 accuracy and show Out of Sample Rate

mean(Accu1_array) #93% SVM 99% for RF

## [1] 0.9923048

1- mean(Accu1_array)

## [1] 0.007695247

As you can see, mean accuracy of the 3 Random forest model is 99.2% with .7% out of sample error.

Now lets try SVM model.

Start Cross validation

set.seed(999)
folds <- createFolds(allData$classe,k= 3)
Accu2_array <- array()## accuracy value storage

Create 3 distinct data sets, get the model, predict and get accuracy for each iteration

for(i in 1:3)
{ 
    data_train <- allData[(-folds[[i]]),]
    data_test  <- allData[(folds[[i]]),]
    fit_svm <- svm(classe ~., method="svm", data= data_train, type= "C-classification")
    pred_svm <- predict(fit_svm,data_test)
    cm <- confusionMatrix(pred_svm,data_test$classe)
    Accu2_array[i] <- as.numeric(cm$overall['Accuracy'])
}

Plot predicted SVM classe

qplot(pred_svm,factor(data_test$classe), xlab = "predicted", ylab = "actual", colour= data_test$classe,geom = c("boxplot", "jitter"))

Get mean of the 3 accuracy and show Out of Sample Rate.

mean(Accu2_array) #93% SVM 99% for RF

## [1] 0.9337992

1- mean(Accu2_array)

## [1] 0.06620082

Mean accuracy of the 3 SVM model is 93.3% with 6.6% out of sample error.

Clearly Random forest model with 99% accuracy is the most accurate.

So lets apply this final RF model to predict the AllDataTest with 20 rows.

final_rf <- predict(fit_rf,allDataTest )
final_rf

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Credits: IGreski for the parallel processing: https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md

DATASET: Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H.Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. (DATASET)