Challenge 2

Supervised machine learning

Use the same dataset from Challenge 1. In this exercise, we’re going to use supervised machine learning to determine the different words used by point Tweets versus polygon Tweet. Our goal is to better understand how Tweet content differs based on geo-location type.

Reload the dataset and functions.R. Change the geo.code column to a dummy variable (ifelse function). We will build a machine learning classifier to determine the words that most characterize each geo-location type.

tweets <- read.csv("../datasets/panthertweets.csv", stringsAsFactors = F)
source("../functions.R")

tweets$geo.dummy <- ifelse(tweets$geo.type=="Point", 0, 1)

Create a training and test set, with 80% and 20%, respectively.

set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]

Construct the DFM. You may want to experiment with different preprocessing techniques until you achieve better performance. Hint - use the trim to remove words that are used at least five times (minDoc = 5).

library(quanteda)
twcorpus <- corpus(tweets$body)
twdfm <- dfm(twcorpus, ignoredFeatures=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"), 
  stem = TRUE, removeTwitter = TRUE)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 1,973 documents
##    ... indexing features: 4,831 feature types
##    ... removed 148 features, from 181 supplied (glob) feature types
##    ... stemming features (English), trimmed 452 feature variants
##    ... created a 1973 x 4232 sparse dfm
##    ... complete. 
## Elapsed time: 0.427 seconds.

twdfm <- trim(twdfm, minDoc = 10)

## Removing features occurring in fewer than 10 documents: 4026

Now run the classifier. Then, compute the accuracy.

library(glmnet)
require(doMC)
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$geo.dummy[training], 
    family="binomial", alpha=0, nfolds=5, parallel=TRUE,
    type.measure="deviance")
plot(ridge)

## function to compute accuracy
accuracy <- function(ypred, y){
    tab <- table(ypred, y)
    return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
    tab <- table(ypred, y)
    return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
    tab <- table(ypred, y)
    return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="response") > mean(tweets$geo.dummy[test])
# confusion matrix
table(preds, tweets$geo.type[test])

##        
## preds   Point Polygon
##   FALSE   168      22
##   TRUE     66     139

# performance metrics
accuracy(preds, tweets$geo.type[test])

## [1] 0.7772152

precision(preds, tweets$geo.type[test])

## [1] 0.6780488

recall(preds, tweets$geo.type[test])

## [1] 0.863354

Identify the features that better predict that tweets the geo-location. What do you learn?

# from the different values of lambda, let's pick the best one
best.lambda <- which(ridge$lambda==ridge$lambda.min)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)

##    panther  keeppound       bank    stadium    america   carolina 
## -0.1673046  0.0930900 -1.1642903 -0.6691043 -0.9194302 -0.7958470

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##           coef          word
## 203 -1.9076825         honor
## 21  -1.7753592    fpyzyv7wxl
## 85  -1.6492308        repost
## 171 -1.6185362        ticket
## 185 -1.5094207         drink
## 195 -1.4988488        uptown
## 128 -1.4749844         photo
## 158 -1.4290188        rivera
## 152 -1.2700382        dabbin
## 138 -1.2669405          live
## 134 -1.2494330           fun
## 186 -1.2023340          beat
## 74  -1.1929466       dabonem
## 78  -1.1896128        nation
## 3   -1.1642903          bank
## 15  -1.0961067 carolinapanth
## 31  -1.0725792         readi
## 95  -1.0624633          blue
## 100 -1.0188258  panthersprid
## 205 -0.9977203       tailgat
## 194 -0.9619470         olsen
## 5   -0.9194302       america
## 172 -0.9169857           may
## 144 -0.9037153         night
## 9   -0.8817934      charlott
## 191 -0.8737298       victori
## 117 -0.8695501     camnewton
## 94  -0.8413751         camvp
## 87  -0.8209091          citi
## 148 -0.8145953           end

paste(df$word[1:30], collapse=", ")

## [1] "honor, fpyzyv7wxl, repost, ticket, drink, uptown, photo, rivera, dabbin, live, fun, beat, dabonem, nation, bank, carolinapanth, readi, blue, panthersprid, tailgat, olsen, america, may, night, charlott, victori, camnewton, camvp, citi, end"

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##          coef            word
## 150 1.5359317            yeah
## 92  1.2184067             lol
## 136 1.1583116         carvsaz
## 166 1.1511382 mymaddenpredict
## 167 1.1511338     eamaddennfl
## 112 1.0569744           gonna
## 180 1.0164001             yes
## 88  0.9834244          sbvote
## 114 0.9656601           first
## 176 0.9566095           excit
## 177 0.9526053             man
## 98  0.9368038          bronco
## 47  0.9175863            play
## 189 0.8932162          offens
## 145 0.8407634  carpanthersnew
## 82  0.8236487      sportscent
## 127 0.8156674            need
## 129 0.7794168            babi
## 79  0.7548971             mvp
## 146 0.7289265            lost
## 190 0.6847048              ok
## 67  0.6814183            make
## 86  0.6754543       touchdown
## 61  0.6610255        tomorrow
## 123 0.6379759            call
## 181 0.6285612           giant
## 156 0.6230837            hell
## 89  0.6033112           start
## 182 0.5965314             ass
## 65  0.5615138             got

paste(df$word[1:30], collapse=", ")

## [1] "yeah, lol, carvsaz, mymaddenpredict, eamaddennfl, gonna, yes, sbvote, first, excit, man, bronco, play, offens, carpanthersnew, sportscent, need, babi, mvp, lost, ok, make, touchdown, tomorrow, call, giant, hell, start, ass, got"

BONUS. Use the function heatmapPlot to run a Google maps on the point Tweets.

Why are such a large concentration in uptown?
Compared to the roads, why do the points tend to occur on intersections?

heatmapPlot(tweets)

Challenge 2

Ryan Wesslen

July 27, 2016

Supervised machine learning