Word Predicting App Documentation

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the Help toolbar button for more details on using R Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Synopsis

This document will show how to build a word predicting application using ngram models. This application behaves like the smartkey features on smartphones.

this document will provide the details on:

  1. How to efficiently build (and clean) an ngram model
  2. Use the most efficient query method to search through the ngram model files
  3. Provide 2 algorithms for word prediction

Preparing the Data

Data used on this project can be downloaded from this link.

Data files consist of 3 files:

  1. Blogs

  2. Tweets

  3. News


Data information from the 3 files:

plot of chunk sample plot


Cleaning Data

Machine specs used to perform this tasks:

Intel core i5-4300U CPU @1.90GHz 2.49GHz

Memory: 8GIG


All the different steps I have taken to clean the data:

  1. Read data -> clean data -> 2-5 tokenize (ngram package) 2-3-4 were all quick until 5 ngram: machine froze

  2. Read data -> 2-5 tokennize and clean (quanteda) worked until 5 grams: run out of memory

  3. Read data -> 2-5 tokenize and clean (quanteda) ->dfm ->dftrim with min freq of 4: _more than 35 minutes on 5 ngram

  4. Read data -> clean data -> 2-5 tokenize(quanteda) ->dfm -> dftrim min freq of 4: _between 30-35 minutes on 5 ngram

  5. Read data -> clean data -> 2-5 tokenize(quanteda) -> dfm(with tolower = false) -> df_trim min freq 4: the fastest, less tha 30 minutes


Clean data function I used:

  1. Concatenate from Ngram

  2. Preprocess to lower case and remove numbers from ngram

  3. Remove cursewords from tm

  4. Gsub remove punctuations, non alphabet characters, foreign charaters, orphaned characters from base r

  5. Remove whitespace from tm


TIP 1: when cleaning, do not use piping from dplyr, memory won't be efficienctly used.

I assigned every task result to a new variable and removed old variable to reclaim memory using rm() and gc() respectively.


TIP 2: since the input file is already been cleaned and converted to lower case, no need to do it again when running the DFM function. Using the same sample token file with object size 99.3MB, here is what you will gain:

Once the ngram is processed, I converted the file to a dataframe using the tidy package and saved it as a file.


Building Ngram Model

At this point I have 3 version (blog,news and tweet) of 2 to 4 ngram files.I loaded same number of ngram files and merged them. Identified all common word combination from all the 3 version and summed up word frequency for accuracy.

I then converted the file from dataframe to data table. Then saved the files accordingly. At this point all the ngram word combination for all X- ngram are all unique. The merged files now becomes the final ngram model.


###The most efficient way to search and filter through the ngram model

Since the ngram model files are all in megabytes in size with milions of data rows, it is important to use a search method thaty is efficient and fast. I found using sqldf is the fastest way searching thorugh millions of data rows. Below is a list of time i took to search thorugh the rows using different search methods.

Using the same sample file with 1,416,902 observations

Dataframe using dplyr to search: 3.55 | 3.37 sec

Datatable using dplyr to search: 3.41 | 3.41 sec

Datatable using sqldf to search: 1.43 | 1.25 sec

Dataframe using sqldf to search: 1.30 | 1.25 sec

SQLDF is the fastest way to search through a large dataset.


2 kinds of Word Prediction Algorithm

1) Straight word input to +1 ngram model search

2) Backoff Algorithm



Link to Word Predicting Presentation

Link to Shiny Word Predicting App