Text Analysis in R

Wouter van Atteveldt
Session 4: Corpus analysis: Comparing and clustering

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

  • Querying Text with AmCAT & R
  • The Document-Term matrix
  • Comparing Corpora
  • Topic Modeling

Comparing Corpora

  • Constrasts more informative than frequencies
  • Compare speakers, media, periods, …
  • corpustools::corpora.compare

Obama's speeches

library(corpustools)
data(sotu)
obama = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
dtm_o = with(subset(sotu.tokens, aid %in% obama & pos1 %in% c("N", "A", "M")),
           dtm.create(aid, lemma))
dtm.wordcloud(dtm_o)

plot of chunk unnamed-chunk-1

Comparing Corpora

dtm_b = with(subset(sotu.tokens, !(aid %in% obama) & pos1 %in% c("N", "A", "M")),
           dtm.create(aid, lemma))
cmp  = corpora.compare(dtm_o, dtm_b)
cmp = arrange(cmp, -chi)
kable(head(cmp))
term termfreq.x termfreq.y termfreq relfreq.x relfreq.y over chi
job 200 56 256 0.0195351 0.0051090 3.3614321 92.34135
terrorist 13 103 116 0.0012698 0.0093970 0.2183120 64.24944
freedom 8 79 87 0.0007814 0.0072074 0.2170491 53.48220
Iraq 15 94 109 0.0014651 0.0085759 0.2574317 52.32461
terror 0 55 55 0.0000000 0.0050178 0.1661740 51.50577
business 109 31 140 0.0106466 0.0028282 3.0423131 49.32315

Contrast plots

with(head(cmp, 100),
plotWords(x=log(over), words = term, wordfreq = chi, random.y = T))