Text Analysis in R

Wouter van Atteveldt
Session 5: Sentiment Analysis and Machine Learning

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

Saturday:

Sentiment Analysis
Machine Learning

Sunday:

Basic visualization
Semantic Network Analysis
Graph Visualization

Sentiment Analysis

What is the tone of a text?
Techniques (e.g. Pang/Lee 2008, Liu 2012)
- Manual coding
- Dictionaries
- Machine Learning
- Crowdsourcing (Benoit ea 2015)

Sentiment Analysis: problems

“The man who leaked cell-phone coverage of Saddam Hussein's execution was arrested”

Language is subjective, ambiguous, creative
What does positive/negative mean?
- e.g. Osgood ea 1957: evaluation, potency, activity
Who is positive/negative about what?
- Sentiment Attribution

Sentiment Analysis resources

Lexicon (dictionary)
Annotated texts
Tools / models

Interlude: Downloading and Parsing files

Lexical Sentiment Analysis

Get list of positive / negative terms
Count occurrences in text
Summarize to sentiment score
Possible improvements
- Word-window approach (tomorrow)
- Deal with negation, intensification

Lexical Sentiment Analysis in R

Nothing new here!
Directly count words in DTM:

library(slam)
reviews$npos = row_sums(dtm[, colnames(dtm) %in% pos_words])

Count words in token list:

tokens$sent[tokens$lemma %in% pos_words] = 1

Interactive Session 5a

Lexical Sentiment Analysis

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

Saturday:

Sentiment Analysis
Machine Learning

Sunday:

Basic visualization
Semantic Network Analysis
Graph Visualization

Machine Learning

Statistical Modeling
- Dependent Variable: sentiment, topic, frame
- Independent Variables: words
Focus: prediction rather than explanation
- Millions of correlated independent variables

Text Classification

Each text has a 'class'
Training documents to fit model
Test documents to gauge accuracy
- (or use cross-validation)
Choices:
- What features?
- Which model?

Text Classification: features

Features: independent variables
Basic approach: each word is a feature
Other options e.g.
- Collocations (n-grams)
- LDA Topics
- Feature selection

Text Classification: models

Naive Bayes
Maximum Entropy
Support Vector Machines
Neural Networks
(deep learning)

Combining models

Ensemble Learning
- Train multiple models
- Decide by vote
Active Learning
- Code limited amount of material
- Train+test model
- Code most difficult cases, repeat

Text Classification in R

Package RTextTools
- Jurka et al, 2013
Based on DTM plus coded classes
Does learning, evaluation, prediction

Text Classification in R

(1) Create 'container' from DTM + coded classes

library(RTextToools)
c = create_container(dtm, classes, 
  trainSize=train, testSize=test, virgin=F)

(2) Train and test model

SVM <- train_model(c,"SVM")
SVM_CLASSIFY <- classify_model(c, SVM)

(3) Evaluate

analytics <- create_analytics(c, SVM_CLASSIFY)

Code new material

is_coded = !is.na(classes)
c = create_container(dtm, classes, 
  trainSize=is_coded, virgin=T)
SVM <- train_model(c,"SVM")
SVM_CLASSIFY <- classify_model(c, SVM)
analytics <- create_analytics(c, SVM_CLASSIFY)
head(analytics@document_summary)

Some links:

Burscher et al 2014: Framing with SVM's
Purpura & Wilkerson 2007: Active Learning for Agenda Coding
Some online resources:

Interactive Session 5b

Text Classification for Sentiment Analysis

Hands-on session 5

Break

Handouts:

Sentiment Analysis Resources
Lexcial Sentiment Analysis
Machine Learning