Anton Antonov
MathematicaVsR at GitHub
November, 2016
This R-Markdown notebook was made for the R-part of the MathematicaVsR project “Text analysis of Trump tweets”.
The project is based in the blog post [1], and this R-notebook uses the data from [1] and provide statistics extensions or alternatives. For conclusions over those statistics see [1].
Here are the libraries used in this R-notebook. In addition to those in [1] the libraries “vcd” and “arules” are used.
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(arules)
library(vcd)
We are not going to repeat the Twitter messages ingestion done in [1] – we are going to use the data frame ingestion result provided in [1].
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
#load("./trump_tweets_df.rda")
This section demonstrates a way to derive word-device associations that is alternative to the approach in [1]. The Association rules learning algorithm Apriori is used through the package “arules”.
First we split the tweet messages into bags of words (baskets).
sres <- strsplit( iconv(tweets$text),"\\s")
sres <- llply( sres, function(x) { x <- unique(x); x[nchar(x)>2] })
The package “arules” does not work directly with lists of lists. (In this case with a list of bags or words or baskets.) We have to derive a binary incidence matrix from the bags of words.
Here we add the device tags to those bags of words and derive a long form of tweet-index and word pairs:
sresDF <-
ldply( 1:length(sres), function(i) {
data.frame( index = i, word = c( tweets$source[i], sres[i][[1]]) )
})
Next we find the contingency matrix for index vs. word:
wordsCT <- xtabs( ~ index + word, sresDF, sparse = TRUE)
At this point we can use the Apriori algorithm of the package:
rulesRes <- apriori( as.matrix(wordsCT), parameter = list(supp = 0.01, conf = 0.6, maxlen = 2, target = "rules"))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.6 0.1 1 none FALSE TRUE 5 0.01 1 2 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 13
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[6572 item(s), 1390 transaction(s)] done [0.00s].
sorting and recoding items ... [184 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2
Mining stopped (maxlen reached). Only patterns up to a length of 2 returned!
done [0.00s].
writing ... [171 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Here are association rules for “Android” sorted by confidence in descending order:
inspect( subset( sort(rulesRes, by="confidence"), subset = rhs %in% "Android" & confidence > 0.78) )
lhs rhs support confidence lift
[1] {A.M.} => {Android} 0.01007194 1.0000000 1.824147
[2] {@megynkelly} => {Android} 0.01726619 1.0000000 1.824147
[3] {@realDonaldTrump} => {Android} 0.08057554 0.9911504 1.808004
[4] {Wow,} => {Android} 0.01510791 0.9545455 1.741231
[5] {time} => {Android} 0.01366906 0.9500000 1.732940
[6] {done} => {Android} 0.01223022 0.9444444 1.722805
[7] {over} => {Android} 0.01079137 0.9375000 1.710138
[8] {president} => {Android} 0.01007194 0.9333333 1.702537
[9] {because} => {Android} 0.01870504 0.9285714 1.693851
[10] {@CNN} => {Android} 0.01726619 0.9230769 1.683828
[11] {were} => {Android} 0.01510791 0.9130435 1.665526
[12] {beat} => {Android} 0.01366906 0.9047619 1.650419
[13] {U.S.} => {Android} 0.01294964 0.9000000 1.641732
[14] {win} => {Android} 0.01870504 0.8965517 1.635442
[15] {big} => {Android} 0.01798561 0.8928571 1.628703
[16] {against} => {Android} 0.01798561 0.8928571 1.628703
[17] {said} => {Android} 0.02230216 0.8857143 1.615673
[18] {made} => {Android} 0.01079137 0.8823529 1.609541
[19] {won} => {Android} 0.01007194 0.8750000 1.596129
[20] {being} => {Android} 0.01007194 0.8750000 1.596129
[21] {country} => {Android} 0.01510791 0.8750000 1.596129
[22] {had} => {Android} 0.01942446 0.8709677 1.588773
[23] {job} => {Android} 0.01438849 0.8695652 1.586215
[24] {Republican} => {Android} 0.02302158 0.8648649 1.577641
[25] {than} => {Android} 0.02230216 0.8611111 1.570793
[26] {@nytimes} => {Android} 0.01294964 0.8571429 1.563555
[27] {media} => {Android} 0.02158273 0.8571429 1.563555
[28] {vote} => {Android} 0.01654676 0.8518519 1.553903
[29] {You} => {Android} 0.01223022 0.8500000 1.550525
[30] {more} => {Android} 0.02446043 0.8500000 1.550525
[31] {jobs} => {Android} 0.01079137 0.8333333 1.520122
[32] {but} => {Android} 0.03165468 0.8301887 1.514386
[33] {would} => {Android} 0.02733813 0.8260870 1.506904
[34] {very} => {Android} 0.03381295 0.8245614 1.504121
[35] {America} => {Android} 0.01007194 0.8235294 1.502239
[36] {got} => {Android} 0.01007194 0.8235294 1.502239
[37] {ever} => {Android} 0.01294964 0.8181818 1.492484
[38] {total} => {Android} 0.01294964 0.8181818 1.492484
[39] {Sanders} => {Android} 0.01582734 0.8148148 1.486342
[40] {totally} => {Android} 0.01870504 0.8125000 1.482119
[41] {@FoxNews} => {Android} 0.01798561 0.8064516 1.471086
[42] {Bernie} => {Android} 0.02374101 0.8048780 1.468216
[43] {Trump} => {Android} 0.04388489 0.8026316 1.464118
[44] {are} => {Android} 0.06402878 0.8018018 1.462604
[45] {that} => {Android} 0.08561151 0.7986577 1.456869
[46] {Ted} => {Android} 0.02517986 0.7954545 1.451026
[47] {what} => {Android} 0.01654676 0.7931034 1.446737
[48] {wants} => {Android} 0.01079137 0.7894737 1.440116
[49] {just} => {Android} 0.03237410 0.7894737 1.440116
[50] {much} => {Android} 0.01582734 0.7857143 1.433258
And here are association rules for “iPhone” sorted by confidence in descending order:
iphRules <- inspect( subset( sort(rulesRes, by="confidence"), subset = rhs %in% "iPhone" & support > 0.01) )
lhs rhs support confidence lift
[1] {#TrumpPence16} => {iPhone} 0.01007194 1.0000000 2.213376
[2] {THANK} => {iPhone} 0.01223022 1.0000000 2.213376
[3] {#ImWithYou} => {iPhone} 0.01366906 1.0000000 2.213376
[4] {#VoteTrump} => {iPhone} 0.01582734 1.0000000 2.213376
[5] {#AmericaFirst} => {iPhone} 0.01942446 1.0000000 2.213376
[6] {Join} => {iPhone} 0.02733813 1.0000000 2.213376
[7] {#Trump2016} => {iPhone} 0.12302158 0.9500000 2.102707
[8] {#CrookedHillary} => {iPhone} 0.01151079 0.9411765 2.083177
[9] {soon!} => {iPhone} 0.01151079 0.9411765 2.083177
[10] {#MakeAmericaGreatAgain} => {iPhone} 0.06546763 0.9100000 2.014172
[11] {#MAGA} => {iPhone} 0.01151079 0.8888889 1.967445
[12] {Thank} => {iPhone} 0.12086331 0.7850467 1.737603
[13] {you} => {iPhone} 0.11151079 0.7142857 1.580983
[14] {tonight} => {iPhone} 0.01366906 0.6785714 1.501934
[15] {AGAIN!} => {iPhone} 0.01798561 0.6410256 1.418831
[16] {New} => {iPhone} 0.02086331 0.6304348 1.395389
[17] {you!} => {iPhone} 0.02446043 0.6296296 1.393607
[18] {&} => {iPhone} 0.03669065 0.6219512 1.376612
Generally speaking, the package “arules” is somewhat awkward to use. For example, extracting the words of the column “lhs” would require some wrangling:
ws <- as.character(unclass(as.character(iphRules$lhs)))
gsub(pattern = "\\{|\\}", "", ws)
[1] "#TrumpPence16" "THANK" "#ImWithYou" "#VoteTrump" "#AmericaFirst" "Join"
[7] "#Trump2016" "#CrookedHillary" "soon!" "#MakeAmericaGreatAgain" "#MAGA" "Thank"
[13] "you" "tonight" "AGAIN!" "New" "you!" "&"
[1] David Robinson, “Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half”, (2016), VarianceExplained.org.