This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the Help toolbar button for more details on using R Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
This document will show how to build a word predicting application using ngram models. This application behaves like the smartkey features on smartphones.
this document will provide the details on:
Data used on this project can be downloaded from this link.
Data files consist of 3 files:
Blogs
Tweets
News
Data information from the 3 files:
Machine specs used to perform this tasks:
Intel core i5-4300U CPU @1.90GHz 2.49GHz
Memory: 8GIG
All the different steps I have taken to clean the data:
Read data -> clean data -> 2-5 tokenize (ngram package) 2-3-4 were all quick until 5 ngram: machine froze
Read data -> 2-5 tokennize and clean (quanteda) worked until 5 grams: run out of memory
Read data -> 2-5 tokenize and clean (quanteda) ->dfm ->dftrim with min freq of 4: _more than 35 minutes on 5 ngram
Read data -> clean data -> 2-5 tokenize(quanteda) ->dfm -> dftrim min freq of 4: _between 30-35 minutes on 5 ngram
Read data -> clean data -> 2-5 tokenize(quanteda) -> dfm(with tolower = false) -> df_trim min freq 4: the fastest, less tha 30 minutes
Clean data function I used:
Concatenate from Ngram
Preprocess to lower case and remove numbers from ngram
Remove cursewords from tm
Gsub remove punctuations, non alphabet characters, foreign charaters, orphaned characters from base r
Remove whitespace from tm
TIP 1: when cleaning, do not use piping from dplyr, memory won't be efficienctly used.
I assigned every task result to a new variable and removed old variable to reclaim memory using rm() and gc() respectively.
TIP 2: since the input file is already been cleaned and converted to lower case, no need to do it again when running the DFM function. Using the same sample token file with object size 99.3MB, here is what you will gain:
Once the ngram is processed, I converted the file to a dataframe using the tidy package and saved it as a file.
At this point I have 3 version (blog,news and tweet) of 2 to 4 ngram files.I loaded same number of ngram files and merged them. Identified all common word combination from all the 3 version and summed up word frequency for accuracy.
I then converted the file from dataframe to data table. Then saved the files accordingly. At this point all the ngram word combination for all X- ngram are all unique. The merged files now becomes the final ngram model.
###The most efficient way to search and filter through the ngram model
Since the ngram model files are all in megabytes in size with milions of data rows, it is important to use a search method thaty is efficient and fast. I found using sqldf is the fastest way searching thorugh millions of data rows. Below is a list of time i took to search thorugh the rows using different search methods.
Using the same sample file with 1,416,902 observations
Dataframe using dplyr to search: 3.55 | 3.37 sec
Datatable using dplyr to search: 3.41 | 3.41 sec
Datatable using sqldf to search: 1.43 | 1.25 sec
Dataframe using sqldf to search: 1.30 | 1.25 sec
SQLDF is the fastest way to search through a large dataset.
1) Straight word input to +1 ngram model search
If entered word is 1, search the 2-ngram model file
If entered word is 2, search the 3-ngram model file
If entered word is 3, search the 4-ngram model file
If entered word is more than 3, use the backoff algorithm
2) Backoff Algorithm
Count the words entered
Process words entered and determine: last word, last 2 words and last 3 words
Using the last 3 words, search the 4-ngram model file
Using the last 2 words, search the 3-ngram model file
Using the last word, search the 2-ngram model file
Link to Word Predicting Presentation
Link to Shiny Word Predicting App