Wouter van Atteveldt
Session 4: Corpus analysis: Comparing and clustering
Thursday: Introduction to R
Friday: Corpus Analysis & Topic Modeling
corpustools::corpora.compare
library(corpustools)
data(sotu)
obama = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
dtm_o = with(subset(sotu.tokens, aid %in% obama & pos1 %in% c("N", "A", "M")),
dtm.create(aid, lemma))
dtm.wordcloud(dtm_o)
dtm_b = with(subset(sotu.tokens, !(aid %in% obama) & pos1 %in% c("N", "A", "M")),
dtm.create(aid, lemma))
cmp = corpora.compare(dtm_o, dtm_b)
cmp = arrange(cmp, -chi)
kable(head(cmp))
term | termfreq.x | termfreq.y | termfreq | relfreq.x | relfreq.y | over | chi |
---|---|---|---|---|---|---|---|
job | 200 | 56 | 256 | 0.0195351 | 0.0051090 | 3.3614321 | 92.34135 |
terrorist | 13 | 103 | 116 | 0.0012698 | 0.0093970 | 0.2183120 | 64.24944 |
freedom | 8 | 79 | 87 | 0.0007814 | 0.0072074 | 0.2170491 | 53.48220 |
Iraq | 15 | 94 | 109 | 0.0014651 | 0.0085759 | 0.2574317 | 52.32461 |
terror | 0 | 55 | 55 | 0.0000000 | 0.0050178 | 0.1661740 | 51.50577 |
business | 109 | 31 | 140 | 0.0106466 | 0.0028282 | 3.0423131 | 49.32315 |
with(head(cmp, 100),
plotWords(x=log(over), words = term, wordfreq = chi, random.y = T))