Challenge 1

Descriptive analyses of text

Our goal is to learn how to find and analyze specific Tweets: Carolina Panther Tweets.

The dataset is a 20% sample of all Charlotte geo-located Tweets in a three month period (Dec 1 2015 to Feb 29 2016).

Read in the data. Run the functions.R file with pre-created functions.

First, plot the dataset using the timePlot function. Add the parameter “smooth = TRUE” to add a smoothing parameter.
What is causing the spikes?

# remember to set your working directory setwd()
raw.tweets <- read.csv("../datasets/CharlotteTweets20Sample.csv", stringsAsFactors = F)
source("../functions.R")
timePlot(raw.tweets, smooth = TRUE)

Save the most common hashtags (function getCommonHashtags); make sure to use the Tweet text (body) as the input. What are the top 10 hashtags?

Repeat the same exercise with the handle function (getCommonHandles).
Modify for the top 25 handles and hashtags.

hashtags <- getCommonHashtags(raw.tweets$body)

head(hashtags, 25)

## hashtags
## #KeepPounding    #Charlotte           #NC   #realestate    #charlotte 
##           487           458           443           429           256 
## #keeppounding      #traffic       #trndnl        #photo      #listing 
##           234           212           205           185           159 
##          #clt       #Repost      #realtor         #CIAA         #SB50 
##           154           144           133           123           109 
##     #Panthers     #CIAA2016      #Concord          #CLT     #panthers 
##            95            89            88            86            86

handles <- getCommonHandles(raw.tweets$body)

head(handles, 25)

## handles
##      @cltairport        @Panthers        @midnight        @panthers 
##              218              212              180              130 
##      @EthanDolan            @ness         @hornets    @GraysonDolan 
##              117               80               78               77 
##    @SportsCenter       @F3theFort @realDonaldTrump     @nodabrewing 
##               69               64               64               63 
##    @CampersHaven   @marcuslemonis       @F3Isotope      @marielawtf 
##               58               52               48               43 
##          @oakley @LifeTimeFitness  @ChickenNGreens            @wcnc 
##               39               38               36               36

Choose hashtags and handles that you think will best identify Carolina Panther Tweets.

put the hashtags and handles into a character vector: names <- c(“keyword1”,“keyword2”)
use paste(names, collapse = "|") to create a string of the keywords (with | as an OR). In a grepl function on the lower case of the Tweet’s text, create a new column of what rows meet your criteria.
Use your new column vector as a row filter on the original datset. Run your new panthers dataset through timePlot.

Determine how many Panther Tweets you can identify for this period. Are these all Panther Tweets?

panthers <- c("panther","keeppounding","panthernation","CameronNewton","LukeKuechly","cam newton ","thomas davis","greg olsen","kuechly","sb50","super bowl","sbvote","superbowl","keep pounding","camvp","josh norman")

# find only the Tweets that contain words in the first list
hit <- grepl(paste(panthers, collapse = "|"), tolower(raw.tweets$body))

# create a dataset with only the tweets that contain these words
panther.tweets <- raw.tweets[hit,]

nrow(panther.tweets)

## [1] 2048

Create a dfm with the body variable on the entire dataset (raw.tweets) using the quanteda package. Add in the covariate panthers dummy variable (hint: use the docvars function after you create your corpus).

Run basic pre-processing, run the top 10 word counts (topfeatures function) and create a word cloud.
Rerun the word cloud but tfidf weighting. What’s the difference?

library(quanteda); library(RColorBrewer)

corpus <- corpus(panther.tweets$body)
docvars(corpus) <- data.frame(geo.type=raw.tweets$geo.type)

dfm <- dfm(corpus, ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), ngrams=c(1,2))

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2,048 documents
##    ... indexing features: 20,517 feature types
##    ... removed 8,803 features, from 182 supplied (glob) feature types
##    ... created a 2048 x 11715 sparse dfm
##    ... complete. 
## Elapsed time: 0.702 seconds.

topfeatures(dfm)

## #keeppounding      panthers             @     @panthers          bank 
##           732           572           479           341           218 
##        @_bank       stadium         super       america     #panthers 
##           209           194           192           191           184

plot(dfm, rot.per=0, scale=c(3.5, .75), colors=brewer.pal(8, "Dark2"), max.words=100)

Rerun the dfm but this time add in the geo.type as a group. Run a comparison word cloud (comparison = TRUE). This may take a few minutes.

What are the differences in content between points and polygon Tweets? Replot the time series, this time creating one plot for each geo.type ()

pnthrdfm <- dfm(corpus, groups = "geo.type", ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), ngrams=c(1,2))

## Creating a dfm from a corpus ...
##    ... grouping texts by variable: geo.type
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2 documents
##    ... indexing features: 22,419 feature types
##    ... removed 9,343 features, from 182 supplied (glob) feature types
##    ... created a 2 x 13077 sparse dfm
##    ... complete. 
## Elapsed time: 9.638 seconds.

plot(pnthrdfm, comparison = TRUE, rot.per=0, scale=c(3.5, .75), max.words=100)

par(nfrow = c(2,1))   # create a 2 x 1 matrix for plots
timePlot(panther.tweets[panther.tweets$geo.type == "Point",], smooth = FALSE)

timePlot(panther.tweets[panther.tweets$geo.type == "Polygon",], smooth = FALSE)

par(nfrow = c(1,1))   # reset

BONUS. Additional pre-processing steps can sometimes improve results. Run additional pre-processing steps including: stemming, Twitter mode, bigrams or trigrams. Use ?dfm help to get a list of the parameters.

Rerun your word clouds. What difference did these pre-processing steps make?
Did you find new words to describe Carolina Panther Tweets?

dfm <- dfm(corpus, ignoredFeatures = c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), stem = TRUE, twitter = TRUE, ngrams=c(1,3))

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 2,048 documents
##    ... indexing features: 23,850 feature types
##    ... removed 13,691 features, from 182 supplied (glob) feature types
##    ... stemming features (English), trimmed 497 feature variants
##    ... created a 2048 x 9663 sparse dfm
##    ... complete. 
## Elapsed time: 0.988 seconds.

topfeatures(dfm)

## #keeppound    panther          @   @panther       bank    stadium 
##        732        676        479        341        218        194 
##      super    america         go       bowl 
##        192        191        187        186

plot(dfm, rot.per=0, scale=c(3.5, .75), colors=brewer.pal(8, "Dark2"), max.words=100)

Challenge 1

Ryan Wesslen (modified)

July 27, 2016

Descriptive analyses of text