Background

This dataset comes from Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Prepare the data

if (!file.exists('training.csv')) download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv","training.csv",mode="wb")
if (!file.exists('testing.csv')) download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv","testing.csv",mode="wb")
library(readr)
training <- read_csv("training.csv")
testing <- read_csv("testing.csv")

Explanatory data analysis

Take a look of the varibles we have.

head(colnames(training),20)
##  [1] "X1"                   "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "kurtosis_roll_belt"  
## [13] "kurtosis_picth_belt"  "kurtosis_yaw_belt"    "skewness_roll_belt"  
## [16] "skewness_roll_belt.1" "skewness_yaw_belt"    "max_roll_belt"       
## [19] "max_picth_belt"       "max_yaw_belt"
tail(colnames(training),5)
## [1] "accel_forearm_z"  "magnet_forearm_x" "magnet_forearm_y"
## [4] "magnet_forearm_z" "classe"

“classe” varible was coded as character, I transformed it into factor.

str(training$classe)
##  chr [1:19622] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" ...
training$classe <- factor(training$classe)

Take a look at how many activity type we have.

table(training$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

According to the paper, “participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes”.

Model building

Feature selection

I tried to replicate the model building process the original researchers did, they “used the feature selection algorithm based on correlation proposed by Hall. The algorithm was configured to use a”Best First" strategy based on backtracking."

(However, I couldn’t replicate the result of this feature selection, on both the modified dataset provided by this specialization and original dataset.)

library(doParallel)                  # doParallel lets use a bunch cores, which speeds things up
cl <- makePSOCKcluster(7)            
registerDoParallel(cl)               # this starts things off

I used cfs() from {FSelector}, which I believe should perform correlation-based feature selection. Selected features (model) is shown below.

library(FSelector)
subset <- cfs(classe ~ ., training[,-(1:7)])
(f <- as.simple.formula(subset, "classe"))
## classe ~ roll_belt + pitch_belt + yaw_belt + magnet_arm_x + gyros_dumbbell_y + 
##     magnet_dumbbell_y + pitch_forearm
## <environment: 0x000000001ee116f0>

Train the model with Random Forest

Again, I choose random forest because it’s generally a good training model for classification and it was also used by the researchers: “Because of the characteristic noise in the sensor data, we used a Random Forest approach.”

I also performed 10-fold cross validation to improve the model. I also set the number of trees to be 50.

library(caret)
modFit1 <- train(f, data = training, 
                 method = "rf",
                 ntree = 50, 
                 trControl = trainControl(method = "cv")
                 )
modFit1
## Random Forest 
## 
## 19622 samples
##     7 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17659, 17660, 17659, 17662, 17660, 17660, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9860358  0.9823404
##   4     0.9853736  0.9815020
##   7     0.9780858  0.9722815
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Since I performed cross validation, the out-of-bag error rate could be a good estimate of the out-of-sample error rate, and the result shows that the error rate (1-Accuracy) is very low (0.0139642).

Finally, I used this model to predict on test set.

predict(modFit1, newdata = testing)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
stopCluster(cl)