Introduction to Machine Learning for Text Analysis

9 May 2018

Structure of the workshop

Brief introduction (10 min)
How to perform basic operations using Quanteda (50 min)
How to use machine learning models (30-45 min)
- Overview of supervised and unsupervised models
- Experiment with different training sets
Questions and answers (30-15 min)

Introduction

What is Quanteda?

quanteda is an R package for quantitative text analysis developed by a team based at the LSE.

After 5 years of development, version 1.0 was released at London R meeting in January
Developed for high consistency (Ken) and performance (Kohei)
Used by leading political scientists in North America, Europe and Asia
It is a stand-alone tool, but can be used to develop packages (e.g. politeness, preText, phrasemachine, tidytext, stm.
Quanteda Initiative CIC was founded to support the text analysts community

Materials to learn how to use Quanteda

Quanteda Documentation: https://docs.quanteda.io
Quanteda Tutorials: https://tutorials.quanteda.io
- Overview
- Data import
- Basic operations
- Statistical analysis
- Advanced operations
- Scaling and classification

Machine learning using Quanteda

Quanteda has original models for political scientists
- textmodel_wordscore() for supervised document scaling
- textmodel_wordfish() for unsupervised document
- textmodel_affinity() for supervised document scaling
We also have functions optimized for large textual data
- textmodel_nb() for naive Bayes classification
- textmodel_ca() for correspondence analysis
- textmodel_lsa() for latent semantic analysis
Other packages works well with Quanteda
- topicmodels or LDA for topic classification
- LSS for semi-supervised document scaling

Overview

Preperation

We use movie review corpus (n=2000) to understand how to use machine learning models. We create a document-feature matrix from sentences only for LSS.

# devtools::install_github("quanteda/quanteda.corpora")
require(quanteda.corpora)
require(quanteda)

corp <- data_corpus_movies
docvars(corp, "manual") <- factor(docvars(corp, "Sentiment"), c("neg", "pos"))
corp_sent <- corpus_reshape(corp)
mt_sent <- dfm(corp_sent, remove_punct = TRUE) %>% 
           dfm_trim(min_termfreq = 5) %>% 
           dfm_remove(stopwords("en"), min_nchar = 2)
mt <- dfm_group(mt_sent, "id2")

If we are not using LSS, the code will be shorter:

mt <- dfm(corp, remove_punct = TRUE) %>% 
          dfm_trim(min_termfreq = 5) %>% 
          dfm_remove(stopwords("en"), min_nchar = 2)

We will save all the manual labels and predictions in data.

data <- data.frame(manual = docvars(mt, "manual"),
                   nb = NA, ws = NA, rf = NA, wf = NA, lss = NA)

Feature selection

You can choose features to be included in the models manually or automatically. The simplest way is to choose the most frequent ones after removing function words (stop words).

feat <- names(topfeatures(mt, 1000))
mt <- dfm_select(mt, feat)

Separate trainig and test sets

Performance of machine learning models have to be trained and tested on different dataset. This is called "out-of-sample" or "holdout" test.

i <- seq(ndoc(mt)) 
l <- i %in% sample(i, 1500)
head(l)

## [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

mt_train <- mt[l,] 
mt_test <- mt[!l,]

Measure of accuracy

Precision and recall are the standard measures of accuracy in classification
- precision measures percentage of true class in predicted class
  - Checks if only the relevant items are retrieved
- recall measures percentage of predicted class in true class
  - Checks if all the relevant items are retrieved

accuracy <- function (x) {
    c(neg_recall =  x[1,1] / sum(x[1,]),
      neg_precision = x[1,1] / sum(x[,1]),
      pos_recall = x[2,2] / sum(x[2,]),
      pos_precision = x[2,2] / sum(x[,2])
    )
}

You can also use caret::confusionMatrix() for more accuracy measures.

Naive Bayes

Naive Bayes is a supervised model for document classification (two more more classes). The model is "naive" because words are treated as independent.

nb <- textmodel_nb(mt_train, docvars(mt_train, "manual"))
data$nb[!l] <- predict(nb, newdata = mt_test) # since v1.2.2
#data$nb[!l] <- predict(nb, newdata = mt_test)$nb.predicted # until v1.2.0
head(data$nb, 20)

##  [1] NA NA  1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  1

tb_nb <- table(data$manual, data$nb, dnn = c("manual", "nb"))
tb_nb

##       nb
## manual   1   2
##    neg 191  55
##    pos  52 202

tb_nb / rowSums(tb_nb)

##       nb
## manual         1         2
##    neg 0.7764228 0.2235772
##    pos 0.2047244 0.7952756

accuracy(tb_nb)

##    neg_recall neg_precision    pos_recall pos_precision 
##     0.7764228     0.7860082     0.7952756     0.7859922

Wordscores

Wordscores is a supervised model for document scaling (continuous dimension), but we dichotomize the predicted scores to compare with classification models.

ws <- textmodel_wordscores(mt_train, as.numeric(docvars(mt_train, "manual")))
data$ws[!l] <- predict(ws, newdata = mt_test)
head(data$ws, 20)

##  [1]       NA       NA 1.480645       NA       NA       NA       NA
##  [8]       NA       NA       NA       NA       NA       NA       NA
## [15]       NA       NA       NA       NA       NA 1.481321

plot(data$manual, data$ws)

tb_ws <- table(data$manual, data$ws > 1.5)
tb_ws

##      
##       FALSE TRUE
##   neg   211   35
##   pos    70  184

tb_ws / rowSums(tb_ws)

##      
##           FALSE      TRUE
##   neg 0.8577236 0.1422764
##   pos 0.2755906 0.7244094

accuracy(tb_ws)

##    neg_recall neg_precision    pos_recall pos_precision 
##     0.8577236     0.7508897     0.7244094     0.8401826

Random Forest

Random forest is a rule-based supervised model that can be used for both scaling and classification. It is "random" because it combines multiple decision trees to improve its prediction accuracy.

# install.packages("randomForest")
require(randomForest)
rf <- randomForest(as.matrix(mt_train), docvars(mt_train, "manual"))
data$rf[!l] <- predict(rf, mt_test)
head(data$rf, 20)

##  [1] NA NA  1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  1

tb_rf <- table(data$manual, data$rf, dnn = c("manual", "rf"))
tb_rf / rowSums(tb_rf)

##       rf
## manual         1         2
##    neg 0.7967480 0.2032520
##    pos 0.1889764 0.8110236

tb_rf

##       rf
## manual   1   2
##    neg 196  50
##    pos  48 206

accuracy(tb_rf)

##    neg_recall neg_precision    pos_recall pos_precision 
##     0.7967480     0.8032787     0.8110236     0.8046875

Wordfish

Wordfish is a unsupervised document scaling model that compute parameters for both features (theta) and documents (beta).

wf <- textmodel_wordfish(mt)
data$wf <- wf$theta
head(data$wf, 20)

##  [1]  0.1564936  1.4026989  0.6085823  0.1760803  0.3939215  3.2890843
##  [7] -0.3647082 -0.4741440 -0.3850306  0.2007902 -0.4272846 -1.0839554
## [13] -0.3803669  1.8204034 -0.1884703  3.3901761  0.3184877  2.4216386
## [19] -0.3049630 -1.4918669

Wordfish parameters are normalized but have random direction.

tb_wf <- table(data$manual, data$wf < 0)
tb_wf

##      
##       FALSE TRUE
##   neg   482  518
##   pos   338  662

tb_wf / rowSums(tb_wf)

##      
##       FALSE  TRUE
##   neg 0.482 0.518
##   pos 0.338 0.662

accuracy(tb_wf)

##    neg_recall neg_precision    pos_recall pos_precision 
##     0.4820000     0.5878049     0.6620000     0.5610169

head(coef(wf, "features")[,"beta"], 20)

##         plot          two         teen           go        party 
##  0.099871996 -0.121263663 -0.826791790 -0.050350878 -1.068192312 
##          get          one         guys   girlfriend          see 
## -0.103645211 -0.009295752 -0.009905168 -0.523901075 -0.032039613 
##         life         deal        watch        movie         find 
## -0.551826958 -0.083093991  0.001089785  0.066300885 -0.028463000 
##         cool         idea          bad        makes       review 
##  0.119198118  0.043299950  0.075673640 -0.062960855  0.115963997

tail(coef(wf, "features")[,"beta"], 20)

##        eddie      million          key       willis       proves 
## -0.705131717  0.087806469  0.034362080 -0.005488957 -0.294641658 
##       leaves    meanwhile      tension      details       longer 
## -0.446585818 -0.117838856  0.085427682 -0.032559659 -0.047399892 
##       murphy      leaving         wars         park        frank 
##  0.173231997 -0.193479944  0.402656964  0.096875440 -0.548319151 
##       subtle    chemistry      cameron     approach       truman 
## -0.224190841 -0.202066640  0.233137412  0.053951384 -2.033043129

Latent Semantic Scaling

LSS is a semi-supervised document scaling model that combines Latent Semantic Analysis and Wordscores.

# devtools::install_github("koheiw/LSS")
require(LSS)
lss <- textmodel_lss(mt_sent, seedwords("pos-neg"), feat, cache = TRUE)
data$lss <- predict(lss, newdata = mt)
tb_lss <- table(data$manual, data$lss > 0)
head(data$lss, 20)

##  [1] -0.523860370  0.035211933  1.128545427  0.002955313  0.176996060
##  [6]  0.048574852 -0.273897146 -1.253643286 -0.049215736 -0.758963054
## [11] -1.054276159  0.237964039 -1.483888215 -0.198716358 -0.260711747
## [16] -1.121059220 -0.528289888  0.452106480 -0.988729663 -1.381473878

seedwords("pos-neg")

##        good        nice   excellent    positive   fortunate     correct 
##           1           1           1           1           1           1 
##    superior         bad       nasty        poor    negative unfortunate 
##           1          -1          -1          -1          -1          -1 
##       wrong    inferior 
##          -1          -1

LSS parameters are normalized and have direction specified by the seed words.

plot(data$manual, data$lss)

tb_lss

##      
##       FALSE TRUE
##   neg   615  385
##   pos   390  610

tb_lss / rowSums(tb_lss)

##      
##       FALSE  TRUE
##   neg 0.615 0.385
##   pos 0.390 0.610

accuracy(tb_lss)

##    neg_recall neg_precision    pos_recall pos_precision 
##     0.6150000     0.6119403     0.6100000     0.6130653

head(coef(lss), 20)

##   excellent        nice        good       space      forced        onto 
##  0.08933867  0.07307900  0.06855113  0.05388397  0.04603286  0.04450034 
##     instead        sees    delivers      others      living     talking 
##  0.04194617  0.03885763  0.03676355  0.03550467  0.03370298  0.03355209 
##        high       gives        gave intelligent      highly      simple 
##  0.03348565  0.03327898  0.03315461  0.03313089  0.03249108  0.03231112 
##      school       stuff 
##  0.03224447  0.03192881

tail(coef(lss), 20)

##   emotional       brief    annoying      appear       every     parents 
## -0.03380183 -0.03491747 -0.03533307 -0.03676034 -0.03740869 -0.03767511 
##    romantic         top    expected       worse        near    supposed 
## -0.03882562 -0.03898860 -0.04016299 -0.04129789 -0.04164269 -0.04247639 
##     release        york       worst       peter        poor     problem 
## -0.04499426 -0.04571499 -0.04887476 -0.05690101 -0.06046258 -0.06751607 
##       wrong         bad 
## -0.06777416 -0.07800422

Comparison

Supervised models (Naive Bayes, Random Forest, Wordscores) performed well, but the Wordfish did not. LSS is somewhere in between.

Experiment

How big training set should be?

Train NB on corpora in different sizes (100 to 1000 documents) to see the changes in classification accuracy.

l2 <- i %in% sample(i, 1000)
mt_test2 <- mt[!l2,]

data2 <- data.frame()
for (n in seq(100, 1000, by = 100)) {
    for (m in seq(1:20)) {
        mt_train2 <- mt[sample(i[l2], n),]
        nb <- textmodel_nb(mt_train2, docvars(mt_train2, "manual"))
        docvars(mt_test2, "nb") <- predict(nb, newdata = mt_test2) # since v1.2.2
        #docvars(mt_test2, "nb") <- predict(nb, newdata = mt_test2)$nb.predicted # until v1.2.0
        tb_temp <- table(docvars(mt_test2, "manual"), docvars(mt_test2, "nb"))
        temp <- as.data.frame(rbind(accuracy(tb_temp)))
        temp$size <- n
        data2 <- rbind(data2, temp)
    }
}

Training set should have 500 or more documents to reach high performance.

It suggests that we have to see the least frequent words in at least 10 to 30 documents.

feat_rare <- tail(feat, 10)
feat_rare

##  [1] "songs"   "race"    "band"    "stands"  "writers" "hardly"  "chan"   
##  [8] "details" "wasted"  "apart"

docfreq(mt_train2)[feat_rare] * (500 /  ndoc(mt_train2))

##   songs    race    band  stands writers  hardly    chan details  wasted 
##    16.5    22.0    22.0    24.5    29.5    29.0     8.5    27.5    29.0 
##   apart 
##    23.5

Distribution of features

Word frequency follows long-tail distribution (Zepf's law), so low rank words are very rare.

The movie review corpus is actually not very sparse (sparsity could be even higher).

sparsity(dfm(corp, remove_punct = TRUE))

## [1] 0.9929643

Conclusions

Machine learning models are easy to use for text analysis using quanteda, but you have be aware of the costs.

Supervised models require large training set, especially when
- corpus is sparse (e.g. news articles)
- category/dimension is specific (e.g. social scientific concepts)
- model is complex (e.g. neural network)
Wordfish often produces random results, especially when
- corpus is sparse
- documents mix different topics
LSS is free of these problems, but
- individual prediction is not very accurate
- requires valid seed words and a very large corpus (> 5000 documents)