This dataset comes from Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
if (!file.exists('training.csv')) download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv","training.csv",mode="wb")
if (!file.exists('testing.csv')) download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv","testing.csv",mode="wb")
library(readr)
training <- read_csv("training.csv")
testing <- read_csv("testing.csv")
Take a look of the varibles we have.
head(colnames(training),20)
## [1] "X1" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt" "total_accel_belt" "kurtosis_roll_belt"
## [13] "kurtosis_picth_belt" "kurtosis_yaw_belt" "skewness_roll_belt"
## [16] "skewness_roll_belt.1" "skewness_yaw_belt" "max_roll_belt"
## [19] "max_picth_belt" "max_yaw_belt"
tail(colnames(training),5)
## [1] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [4] "magnet_forearm_z" "classe"
“classe” varible was coded as character, I transformed it into factor.
str(training$classe)
## chr [1:19622] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" ...
training$classe <- factor(training$classe)
Take a look at how many activity type we have.
table(training$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
According to the paper, “participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in ï¬ve diï¬erent fashions: exactly according to the speciï¬cation (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the speciï¬ed execution of the exercise, while the other 4 classes correspond to common mistakes”.
I tried to replicate the model building process the original researchers did, they “used the feature selection algorithm based on correlation proposed by Hall. The algorithm was conï¬gured to use a”Best First" strategy based on backtracking."
(However, I couldn’t replicate the result of this feature selection, on both the modified dataset provided by this specialization and original dataset.)
library(doParallel) # doParallel lets use a bunch cores, which speeds things up
cl <- makePSOCKcluster(7)
registerDoParallel(cl) # this starts things off
I used cfs()
from {FSelector}
, which I believe should perform correlation-based feature selection. Selected features (model) is shown below.
library(FSelector)
subset <- cfs(classe ~ ., training[,-(1:7)])
(f <- as.simple.formula(subset, "classe"))
## classe ~ roll_belt + pitch_belt + yaw_belt + magnet_arm_x + gyros_dumbbell_y +
## magnet_dumbbell_y + pitch_forearm
## <environment: 0x000000001ee116f0>
Again, I choose random forest because it’s generally a good training model for classification and it was also used by the researchers: “Because of the characteristic noise in the sensor data, we used a Random Forest approach.”
I also performed 10-fold cross validation to improve the model. I also set the number of trees to be 50.
library(caret)
modFit1 <- train(f, data = training,
method = "rf",
ntree = 50,
trControl = trainControl(method = "cv")
)
modFit1
## Random Forest
##
## 19622 samples
## 7 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17659, 17660, 17659, 17662, 17660, 17660, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9860358 0.9823404
## 4 0.9853736 0.9815020
## 7 0.9780858 0.9722815
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Since I performed cross validation, the out-of-bag error rate could be a good estimate of the out-of-sample error rate, and the result shows that the error rate (1-Accuracy) is very low (0.0139642).
Finally, I used this model to predict on test set.
predict(modFit1, newdata = testing)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
stopCluster(cl)