Our goal is to learn how to find and analyze specific Tweets: Carolina Panther Tweets.
The dataset is a 20% sample of all Charlotte geo-located Tweets in a three month period (Dec 1 2015 to Feb 29 2016).
functions.R
file with pre-created functions.First, plot the dataset using the timePlot
function. Add the parameter “smooth = TRUE” to add a smoothing parameter.
What is causing the spikes?
# remember to set your working directory setwd()
raw.tweets <- read.csv("../datasets/CharlotteTweets20Sample.csv", stringsAsFactors = F)
source("../functions.R")
timePlot(raw.tweets, smooth = TRUE)
getCommonHashtags
); make sure to use the Tweet text (body) as the input. What are the top 10 hashtags?Repeat the same exercise with the handle function (getCommonHandles
).
Modify for the top 25 handles and hashtags.
hashtags <- getCommonHashtags(raw.tweets$body)
head(hashtags, 25)
## hashtags
## #KeepPounding #Charlotte #NC #realestate #charlotte
## 487 458 443 429 256
## #keeppounding #traffic #trndnl #photo #listing
## 234 212 205 185 159
## #clt #Repost #realtor #CIAA #SB50
## 154 144 133 123 109
## #Panthers #CIAA2016 #Concord #CLT #panthers
## 95 89 88 86 86
handles <- getCommonHandles(raw.tweets$body)
head(handles, 25)
## handles
## @cltairport @Panthers @midnight @panthers
## 218 212 180 130
## @EthanDolan @ness @hornets @GraysonDolan
## 117 80 78 77
## @SportsCenter @F3theFort @realDonaldTrump @nodabrewing
## 69 64 64 63
## @CampersHaven @marcuslemonis @F3Isotope @marielawtf
## 58 52 48 43
## @oakley @LifeTimeFitness @ChickenNGreens @wcnc
## 39 38 36 36
put the hashtags and handles into a character vector: names <- c(“keyword1”,“keyword2”)
use paste(names, collapse = "|")
to create a string of the keywords (with | as an OR). In a grepl
function on the lower case of the Tweet’s text, create a new column of what rows meet your criteria.
Use your new column vector as a row filter on the original datset. Run your new panthers dataset through timePlot
.
Determine how many Panther Tweets you can identify for this period. Are these all Panther Tweets?
panthers <- c("panther","keeppounding","panthernation","CameronNewton","LukeKuechly","cam newton ","thomas davis","greg olsen","kuechly","sb50","super bowl","sbvote","superbowl","keep pounding","camvp","josh norman")
# find only the Tweets that contain words in the first list
hit <- grepl(paste(panthers, collapse = "|"), tolower(raw.tweets$body))
# create a dataset with only the tweets that contain these words
panther.tweets <- raw.tweets[hit,]
nrow(panther.tweets)
## [1] 2048
quanteda
package. Add in the covariate panthers dummy variable (hint: use the docvars
function after you create your corpus).Run basic pre-processing, run the top 10 word counts (topfeatures
function) and create a word cloud.
Rerun the word cloud but tfidf weighting. What’s the difference?
library(quanteda); library(RColorBrewer)
corpus <- corpus(panther.tweets$body)
docvars(corpus) <- data.frame(geo.type=raw.tweets$geo.type)
dfm <- dfm(corpus, ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), ngrams=c(1,2))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2,048 documents
## ... indexing features: 20,517 feature types
## ... removed 8,803 features, from 182 supplied (glob) feature types
## ... created a 2048 x 11715 sparse dfm
## ... complete.
## Elapsed time: 0.702 seconds.
topfeatures(dfm)
## #keeppounding panthers @ @panthers bank
## 732 572 479 341 218
## @_bank stadium super america #panthers
## 209 194 192 191 184
plot(dfm, rot.per=0, scale=c(3.5, .75), colors=brewer.pal(8, "Dark2"), max.words=100)
comparison = TRUE
). This may take a few minutes.pnthrdfm <- dfm(corpus, groups = "geo.type", ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), ngrams=c(1,2))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: geo.type
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 22,419 feature types
## ... removed 9,343 features, from 182 supplied (glob) feature types
## ... created a 2 x 13077 sparse dfm
## ... complete.
## Elapsed time: 9.638 seconds.
plot(pnthrdfm, comparison = TRUE, rot.per=0, scale=c(3.5, .75), max.words=100)
par(nfrow = c(2,1)) # create a 2 x 1 matrix for plots
timePlot(panther.tweets[panther.tweets$geo.type == "Point",], smooth = FALSE)
timePlot(panther.tweets[panther.tweets$geo.type == "Polygon",], smooth = FALSE)
par(nfrow = c(1,1)) # reset
BONUS. Additional pre-processing steps can sometimes improve results. Run additional pre-processing steps including: stemming, Twitter mode, bigrams or trigrams. Use ?dfm help to get a list of the parameters.
Rerun your word clouds. What difference did these pre-processing steps make?
Did you find new words to describe Carolina Panther Tweets?
dfm <- dfm(corpus, ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), stem = TRUE, twitter = TRUE, ngrams=c(1,3))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2,048 documents
## ... indexing features: 23,850 feature types
## ... removed 13,691 features, from 182 supplied (glob) feature types
## ... stemming features (English), trimmed 497 feature variants
## ... created a 2048 x 9663 sparse dfm
## ... complete.
## Elapsed time: 0.988 seconds.
topfeatures(dfm)
## #keeppound panther @ @panther bank stadium
## 732 676 479 341 218 194
## super america go bowl
## 192 191 187 186
plot(dfm, rot.per=0, scale=c(3.5, .75), colors=brewer.pal(8, "Dark2"), max.words=100)