Text Analysis in R

Wouter van Atteveldt
Session 5: Sentiment Analysis and Machine Learning

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

Saturday:

  • Sentiment Analysis
  • Machine Learning

Sunday:

  • Basic visualization
  • Semantic Network Analysis
  • Graph Visualization

Sentiment Analysis

  • What is the tone of a text?
  • Techniques (e.g. Pang/Lee 2008, Liu 2012)
    • Manual coding
    • Dictionaries
    • Machine Learning
    • Crowdsourcing (Benoit ea 2015)

Sentiment Analysis: problems

“The man who leaked cell-phone coverage of Saddam Hussein's execution was arrested”

  • Language is subjective, ambiguous, creative
  • What does positive/negative mean?
    • e.g. Osgood ea 1957: evaluation, potency, activity
  • Who is positive/negative about what?
    • Sentiment Attribution

Sentiment Analysis resources

  • Lexicon (dictionary)
  • Annotated texts
  • Tools / models

Interlude: Downloading and Parsing files

Lexical Sentiment Analysis

  • Get list of positive / negative terms
  • Count occurrences in text
  • Summarize to sentiment score
  • Possible improvements
    • Word-window approach (tomorrow)
    • Deal with negation, intensification

Lexical Sentiment Analysis in R

  • Nothing new here!
  • Directly count words in DTM:
library(slam)
reviews$npos = row_sums(dtm[, colnames(dtm) %in% pos_words])
  • Count words in token list:
tokens$sent[tokens$lemma %in% pos_words] = 1

Interactive Session 5a

Lexical Sentiment Analysis

Course Overview

Thursday: Introduction to R

Friday: Corpus Analysis & Topic Modeling

Saturday:

  • Sentiment Analysis
  • Machine Learning

Sunday:

  • Basic visualization
  • Semantic Network Analysis
  • Graph Visualization

Machine Learning

  • Statistical Modeling
    • Dependent Variable: sentiment, topic, frame
    • Independent Variables: words
  • Focus: prediction rather than explanation
    • Millions of correlated independent variables

Text Classification

  • Each text has a 'class'
  • Training documents to fit model
  • Test documents to gauge accuracy
    • (or use cross-validation)
  • Choices:
    • What features?
    • Which model?

Text Classification: features

  • Features: independent variables
  • Basic approach: each word is a feature
  • Other options e.g.
    • Collocations (n-grams)
    • LDA Topics
    • Feature selection

Text Classification: models

  • Naive Bayes
  • Maximum Entropy
  • Support Vector Machines
  • Neural Networks
  • (deep learning)

Combining models

  • Ensemble Learning
    • Train multiple models
    • Decide by vote
  • Active Learning
    • Code limited amount of material
    • Train+test model
    • Code most difficult cases, repeat

Text Classification in R

  • Package RTextTools
    • Jurka et al, 2013
  • Based on DTM plus coded classes
  • Does learning, evaluation, prediction

Text Classification in R

(1) Create 'container' from DTM + coded classes

library(RTextToools)
c = create_container(dtm, classes, 
  trainSize=train, testSize=test, virgin=F)

(2) Train and test model

SVM <- train_model(c,"SVM")
SVM_CLASSIFY <- classify_model(c, SVM)

(3) Evaluate

analytics <- create_analytics(c, SVM_CLASSIFY)

Code new material

is_coded = !is.na(classes)
c = create_container(dtm, classes, 
  trainSize=is_coded, virgin=T)
SVM <- train_model(c,"SVM")
SVM_CLASSIFY <- classify_model(c, SVM)
analytics <- create_analytics(c, SVM_CLASSIFY)
head(analytics@document_summary)

Some links:

Interactive Session 5b

Text Classification for Sentiment Analysis

Hands-on session 5

Break

Handouts:

  • Sentiment Analysis Resources
  • Lexcial Sentiment Analysis
  • Machine Learning