Word Predicting App Documentation

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the Help toolbar button for more details on using R Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Synopsis

This document will show how to build a word predicting application using ngram models. This application behaves like the smartkey features on smartphones.

this document will provide the details on:

How to efficiently build (and clean) an ngram model
Use the most efficient query method to search through the ngram model files
Provide 2 algorithms for word prediction

Preparing the Data

Data used on this project can be downloaded from this link.

Data files consist of 3 files:

Blogs
Tweets
News

Data information from the 3 files:

plot of chunk sample plot

Cleaning Data

Machine specs used to perform this tasks:

Intel core i5-4300U CPU @1.90GHz 2.49GHz

Memory: 8GIG

All the different steps I have taken to clean the data:

Read data -> clean data -> 2-5 tokenize (ngram package) 2-3-4 were all quick until 5 ngram: machine froze
Read data -> 2-5 tokennize and clean (quanteda) worked until 5 grams: run out of memory
Read data -> 2-5 tokenize and clean (quanteda) ->dfm ->dftrim with min freq of 4: _more than 35 minutes on 5 ngram
Read data -> clean data -> 2-5 tokenize(quanteda) ->dfm -> dftrim min freq of 4: _between 30-35 minutes on 5 ngram
Read data -> clean data -> 2-5 tokenize(quanteda) -> dfm(with tolower = false) -> df_trim min freq 4: the fastest, less tha 30 minutes

Clean data function I used:

Concatenate from Ngram
Preprocess to lower case and remove numbers from ngram
Remove cursewords from tm
Gsub remove punctuations, non alphabet characters, foreign charaters, orphaned characters from base r
Remove whitespace from tm

TIP 1: when cleaning, do not use piping from dplyr, memory won't be efficienctly used.

I assigned every task result to a new variable and removed old variable to reclaim memory using rm() and gc() respectively.

TIP 2: since the input file is already been cleaned and converted to lower case, no need to do it again when running the DFM function. Using the same sample token file with object size 99.3MB, here is what you will gain:

dfm using the default tolower = TRUE it took 6.9sec and 5.81 seconds.
dfm using tolower = FALSE it took 4.63 and 4.40 seconds.

Once the ngram is processed, I converted the file to a dataframe using the tidy package and saved it as a file.

Building Ngram Model

At this point I have 3 version (blog,news and tweet) of 2 to 4 ngram files.I loaded same number of ngram files and merged them. Identified all common word combination from all the 3 version and summed up word frequency for accuracy.

I then converted the file from dataframe to data table. Then saved the files accordingly. At this point all the ngram word combination for all X- ngram are all unique. The merged files now becomes the final ngram model.

###The most efficient way to search and filter through the ngram model

Since the ngram model files are all in megabytes in size with milions of data rows, it is important to use a search method thaty is efficient and fast. I found using sqldf is the fastest way searching thorugh millions of data rows. Below is a list of time i took to search thorugh the rows using different search methods.

Using the same sample file with 1,416,902 observations

Dataframe using dplyr to search: 3.55 | 3.37 sec

Datatable using dplyr to search: 3.41 | 3.41 sec

Datatable using sqldf to search: 1.43 | 1.25 sec

Dataframe using sqldf to search: 1.30 | 1.25 sec

SQLDF is the fastest way to search through a large dataset.

2 kinds of Word Prediction Algorithm

1) Straight word input to +1 ngram model search

If entered word is 1, search the 2-ngram model file
If entered word is 2, search the 3-ngram model file
If entered word is 3, search the 4-ngram model file
If entered word is more than 3, use the backoff algorithm

2) Backoff Algorithm

Count the words entered
Process words entered and determine: last word, last 2 words and last 3 words
Using the last 3 words, search the 4-ngram model file
Using the last 2 words, search the 3-ngram model file
Using the last word, search the 2-ngram model file

Link to Word Predicting Presentation

Link to Shiny Word Predicting App