Descriptive analyses of text

Our goal is to learn how to find and analyze specific Tweets: Carolina Panther Tweets.

The dataset is a 20% sample of all Charlotte geo-located Tweets in a three month period (Dec 1 2015 to Feb 29 2016).

  1. Read in the data. Run the functions.R file with pre-created functions.
# remember to set your working directory setwd()
raw.tweets <- read.csv("../datasets/CharlotteTweets20Sample.csv", stringsAsFactors = F)
source("../functions.R")
timePlot(raw.tweets, smooth = TRUE)

  1. Save the most common hashtags (function getCommonHashtags); make sure to use the Tweet text (body) as the input. What are the top 10 hashtags?
hashtags <- getCommonHashtags(raw.tweets$body)

head(hashtags, 25)
## hashtags
## #KeepPounding    #Charlotte           #NC   #realestate    #charlotte 
##           487           458           443           429           256 
## #keeppounding      #traffic       #trndnl        #photo      #listing 
##           234           212           205           185           159 
##          #clt       #Repost      #realtor         #CIAA         #SB50 
##           154           144           133           123           109 
##     #Panthers     #CIAA2016      #Concord          #CLT     #panthers 
##            95            89            88            86            86
handles <- getCommonHandles(raw.tweets$body)

head(handles, 25)
## handles
##      @cltairport        @Panthers        @midnight        @panthers 
##              218              212              180              130 
##      @EthanDolan            @ness         @hornets    @GraysonDolan 
##              117               80               78               77 
##    @SportsCenter       @F3theFort @realDonaldTrump     @nodabrewing 
##               69               64               64               63 
##    @CampersHaven   @marcuslemonis       @F3Isotope      @marielawtf 
##               58               52               48               43 
##          @oakley @LifeTimeFitness  @ChickenNGreens            @wcnc 
##               39               38               36               36
  1. Choose hashtags and handles that you think will best identify Carolina Panther Tweets.

Determine how many Panther Tweets you can identify for this period. Are these all Panther Tweets?

panthers <- c("panther","keeppounding","panthernation","CameronNewton","LukeKuechly","cam newton ","thomas davis","greg olsen","kuechly","sb50","super bowl","sbvote","superbowl","keep pounding","camvp","josh norman")

# find only the Tweets that contain words in the first list
hit <- grepl(paste(panthers, collapse = "|"), tolower(raw.tweets$body))

# create a dataset with only the tweets that contain these words
panther.tweets <- raw.tweets[hit,]

nrow(panther.tweets)
## [1] 2048
  1. Create a dfm with the body variable on the entire dataset (raw.tweets) using the quanteda package. Add in the covariate panthers dummy variable (hint: use the docvars function after you create your corpus).
library(quanteda); library(RColorBrewer)

corpus <- corpus(panther.tweets$body)
docvars(corpus) <- data.frame(geo.type=raw.tweets$geo.type)

dfm <- dfm(corpus, ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), ngrams=c(1,2))
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2,048 documents
##    ... indexing features: 20,517 feature types
##    ... removed 8,803 features, from 182 supplied (glob) feature types
##    ... created a 2048 x 11715 sparse dfm
##    ... complete. 
## Elapsed time: 0.702 seconds.
topfeatures(dfm)
## #keeppounding      panthers             @     @panthers          bank 
##           732           572           479           341           218 
##        @_bank       stadium         super       america     #panthers 
##           209           194           192           191           184
plot(dfm, rot.per=0, scale=c(3.5, .75), colors=brewer.pal(8, "Dark2"), max.words=100)

  1. Rerun the dfm but this time add in the geo.type as a group. Run a comparison word cloud (comparison = TRUE). This may take a few minutes.
pnthrdfm <- dfm(corpus, groups = "geo.type", ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), ngrams=c(1,2))
## Creating a dfm from a corpus ...
##    ... grouping texts by variable: geo.type
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2 documents
##    ... indexing features: 22,419 feature types
##    ... removed 9,343 features, from 182 supplied (glob) feature types
##    ... created a 2 x 13077 sparse dfm
##    ... complete. 
## Elapsed time: 9.638 seconds.
plot(pnthrdfm, comparison = TRUE, rot.per=0, scale=c(3.5, .75), max.words=100)

par(nfrow = c(2,1))   # create a 2 x 1 matrix for plots
timePlot(panther.tweets[panther.tweets$geo.type == "Point",], smooth = FALSE)

timePlot(panther.tweets[panther.tweets$geo.type == "Polygon",], smooth = FALSE)

par(nfrow = c(1,1))   # reset

BONUS. Additional pre-processing steps can sometimes improve results. Run additional pre-processing steps including: stemming, Twitter mode, bigrams or trigrams. Use ?dfm help to get a list of the parameters.

dfm <- dfm(corpus, ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), stem = TRUE, twitter = TRUE, ngrams=c(1,3))
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2,048 documents
##    ... indexing features: 23,850 feature types
##    ... removed 13,691 features, from 182 supplied (glob) feature types
##    ... stemming features (English), trimmed 497 feature variants
##    ... created a 2048 x 9663 sparse dfm
##    ... complete. 
## Elapsed time: 0.988 seconds.
topfeatures(dfm)
## #keeppound    panther          @   @panther       bank    stadium 
##        732        676        479        341        218        194 
##      super    america         go       bowl 
##        192        191        187        186
plot(dfm, rot.per=0, scale=c(3.5, .75), colors=brewer.pal(8, "Dark2"), max.words=100)