if (!require("pacman")) install.packages("pacman")
pacman::p_load(caret, e1071, kernlab, ggplot2, ISLR, Hmisc, gridExtra, RANN, AppliedPredictiveModeling)
Steps:
Spam example:
library(kernlab)
data(spam)
head(spam$your)
## [1] 0.96 1.59 0.51 0.31 0.31 0.00
Density distributions of word ‘your’ in spam and non-spam emails
plot(density(spam$your[spam$type=='nonspam']),
col='blue', main='', xlab="frequency of word 'your'")
lines(density(spam$your[spam$type=='spam']), col='red')
Algorithm. put a cutoff threshold, right after the peak of blue non-spam emails. Anything having more ‘your’ word above that, is spam, not spam otherwise.
prediction <- ifelse(spam$your > 0.5, 'spam', 'nonspam')
table(prediction, spam$type)/length(spam$type)
##
## prediction nonspam spam
## nonspam 0.4590306 0.1017170
## spam 0.1469246 0.2923278
Meaning, with this simple algorithm you will be right about spam detection approximatelly 75% of the time:
0.4590306 + 0.2923278
## [1] 0.7513584
Building a predictor in terms of average number of capital letters (spam$capitalAve):
library(kernlab)
data(spam)
set.seed(333)
smallSpam <- spam[sample(dim(spam)[1], size=10),]
spamLabel <- (smallSpam$type=='spam')*1 + 1
plot(smallSpam$capitalAve, col=spamLabel)
We will build a predictor that separate red dots (spam) from black dots (ham) based on the average of capital letters:
rule1 <- function(x) {
prediction <- rep(NA, length(x))
prediction[x > 2.7] <- 'spam'
prediction[x < 2.4] <- 'nonspam'
prediction[x >= 2.4 & x <= 2.45] <- 'spam'
prediction[x > 2.45 & x <= 2.70] <- 'nonspam'
return(prediction)
}
table(rule1(smallSpam$capitalAve), smallSpam$type)
##
## nonspam spam
## nonspam 5 0
## spam 0 5
100% correct, in sample error is zero
rule2 <- function(x) {
prediction <- rep(NA, length(x))
prediction[x > 2.8] <- 'spam'
prediction[x <= 2.8] <- 'nonspam'
return(prediction)
}
table(rule2(smallSpam$capitalAve), smallSpam$type)
##
## nonspam spam
## nonspam 5 1
## spam 0 4
About 90% correct, in sample error is 10%
What if we apply to all data?
table(rule1(spam$capitalAve), spam$type)
##
## nonspam spam
## nonspam 2141 588
## spam 647 1225
The out-sample error increased from 0 to 26.8419909% - this is an example of overfitting - the rules were defined too tightly to the sample dataset.
table(rule2(spam$capitalAve), spam$type)
##
## nonspam spam
## nonspam 2224 642
## spam 564 1171
How many times were we right for each rule (accuracy)?
c(sum(rule1(spam$capitalAve) == spam$type),
sum(rule2(spam$capitalAve) == spam$type))
## [1] 3366 3395
Suppose that we have created a machine learning algorithm that predicts whether a link will be clicked with 99% sensitivity and 99% specificity. The rate the link is clicked is 1/1000 of visits to a website. If we predict the link will be clicked on a specific visit, what is the probability it will actually be clicked?
sensitivity <- specificity <- 0.99
population <- 100000
rateClick <- 1/1000
\(sensitivity = \frac {TP}{(TP + FN)}\)
FN <- 1
TP <- FN*sensitivity*100
\(population = TP + FN + FP + TN\)
\(FP + TN = population - (FN + TP)\)
FPplusTN <- population - (FN + TP)
\(specificity = \frac {TN}{(TN + FP)}\)
TN <- specificity * FPplusTN
FP <- population - (FN + TP + TN)
c(TP, FP)
## [1] 99 999
c(FN, TN)
## [1] 1 98901
Positive predictive value is the probablity the link will be clicked:
\(PPV = \frac {TP}{(TP + FP)}\)
TP/(TP+FP)
## [1] 0.09016393
i.e. ~ 9.01%