Use the same dataset from Challenge 1. In this exercise, we’re going to use supervised machine learning to determine the different words used by point Tweets versus polygon Tweet. Our goal is to better understand how Tweet content differs based on geo-location type.
functions.R
. Change the geo.code column to a dummy variable (ifelse
function). We will build a machine learning classifier to determine the words that most characterize each geo-location type.tweets <- read.csv("../datasets/panthertweets.csv", stringsAsFactors = F)
source("../functions.R")
tweets$geo.dummy <- ifelse(tweets$geo.type=="Point", 0, 1)
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]
trim
to remove words that are used at least five times (minDoc = 5
).library(quanteda)
twcorpus <- corpus(tweets$body)
twdfm <- dfm(twcorpus, ignoredFeatures=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"),
stem = TRUE, removeTwitter = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 1,973 documents
## ... indexing features: 4,831 feature types
## ... removed 148 features, from 181 supplied (glob) feature types
## ... stemming features (English), trimmed 452 feature variants
## ... created a 1973 x 4232 sparse dfm
## ... complete.
## Elapsed time: 0.427 seconds.
twdfm <- trim(twdfm, minDoc = 10)
## Removing features occurring in fewer than 10 documents: 4026
library(glmnet)
require(doMC)
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$geo.dummy[training],
family="binomial", alpha=0, nfolds=5, parallel=TRUE,
type.measure="deviance")
plot(ridge)
## function to compute accuracy
accuracy <- function(ypred, y){
tab <- table(ypred, y)
return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
tab <- table(ypred, y)
return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
tab <- table(ypred, y)
return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="response") > mean(tweets$geo.dummy[test])
# confusion matrix
table(preds, tweets$geo.type[test])
##
## preds Point Polygon
## FALSE 168 22
## TRUE 66 139
# performance metrics
accuracy(preds, tweets$geo.type[test])
## [1] 0.7772152
precision(preds, tweets$geo.type[test])
## [1] 0.6780488
recall(preds, tweets$geo.type[test])
## [1] 0.863354
# from the different values of lambda, let's pick the best one
best.lambda <- which(ridge$lambda==ridge$lambda.min)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)
## panther keeppound bank stadium america carolina
## -0.1673046 0.0930900 -1.1642903 -0.6691043 -0.9194302 -0.7958470
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 203 -1.9076825 honor
## 21 -1.7753592 fpyzyv7wxl
## 85 -1.6492308 repost
## 171 -1.6185362 ticket
## 185 -1.5094207 drink
## 195 -1.4988488 uptown
## 128 -1.4749844 photo
## 158 -1.4290188 rivera
## 152 -1.2700382 dabbin
## 138 -1.2669405 live
## 134 -1.2494330 fun
## 186 -1.2023340 beat
## 74 -1.1929466 dabonem
## 78 -1.1896128 nation
## 3 -1.1642903 bank
## 15 -1.0961067 carolinapanth
## 31 -1.0725792 readi
## 95 -1.0624633 blue
## 100 -1.0188258 panthersprid
## 205 -0.9977203 tailgat
## 194 -0.9619470 olsen
## 5 -0.9194302 america
## 172 -0.9169857 may
## 144 -0.9037153 night
## 9 -0.8817934 charlott
## 191 -0.8737298 victori
## 117 -0.8695501 camnewton
## 94 -0.8413751 camvp
## 87 -0.8209091 citi
## 148 -0.8145953 end
paste(df$word[1:30], collapse=", ")
## [1] "honor, fpyzyv7wxl, repost, ticket, drink, uptown, photo, rivera, dabbin, live, fun, beat, dabonem, nation, bank, carolinapanth, readi, blue, panthersprid, tailgat, olsen, america, may, night, charlott, victori, camnewton, camvp, citi, end"
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 150 1.5359317 yeah
## 92 1.2184067 lol
## 136 1.1583116 carvsaz
## 166 1.1511382 mymaddenpredict
## 167 1.1511338 eamaddennfl
## 112 1.0569744 gonna
## 180 1.0164001 yes
## 88 0.9834244 sbvote
## 114 0.9656601 first
## 176 0.9566095 excit
## 177 0.9526053 man
## 98 0.9368038 bronco
## 47 0.9175863 play
## 189 0.8932162 offens
## 145 0.8407634 carpanthersnew
## 82 0.8236487 sportscent
## 127 0.8156674 need
## 129 0.7794168 babi
## 79 0.7548971 mvp
## 146 0.7289265 lost
## 190 0.6847048 ok
## 67 0.6814183 make
## 86 0.6754543 touchdown
## 61 0.6610255 tomorrow
## 123 0.6379759 call
## 181 0.6285612 giant
## 156 0.6230837 hell
## 89 0.6033112 start
## 182 0.5965314 ass
## 65 0.5615138 got
paste(df$word[1:30], collapse=", ")
## [1] "yeah, lol, carvsaz, mymaddenpredict, eamaddennfl, gonna, yes, sbvote, first, excit, man, bronco, play, offens, carpanthersnew, sportscent, need, babi, mvp, lost, ok, make, touchdown, tomorrow, call, giant, hell, start, ass, got"
BONUS. Use the function heatmapPlot
to run a Google maps on the point Tweets.
Why are such a large concentration in uptown?
Compared to the roads, why do the points tend to occur on intersections?
heatmapPlot(tweets)