Data Science Capstone – Milestone Report


Executive Summary

This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This Milestone Report describes the major features of the data with my exploratory data analysis and summarizes. To get started with the Milestone Report I’ve download the Coursera Swiftkey Dataset. Also I’ve defind my plans for creating the predictive model(s) and a Shiny App as data product.

All the code is attached as Appendix.

Files used:

File details and stats

Let’s have a look at the files. I determined the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter). Also I calculate some basic stats on the number of words per line (WPL).

File Lines LinesNEmpty Chars CharsNWhite TotalWords WPL_Min WPL_Mean WPL_Max
blogs 899288 899288 206824382 170389539 37570839 0 41.75107 6726
news 1010242 1010242 203223154 169860866 34494539 1 34.40997 1796
twitter 2360148 2360148 162096241 134082806 30451170 1 12.75065 47

Sample the data

The data files are very hugh, I will get a sample of 1% of every file and save it to RDS file sample.rds for saving space. We can load it in for starting the analysis.

Preprocessing the data

After loading the sample RDS file, I created a Corpus and start to analyse the data with the tm library.

There is a lot of information in the data I do not need and is not usefull. I need to clean it up and removed all numbers, convert text to lowercase, remove punctuation and stopwords, in this case english. After that. had I performed stemming, a stem is a form to which affixes can be attached. An example of this is wait, waits, waited, waiting, all of them are common to wait. When the stemming is done, I had removed a lot of characters which resulted in a lot of whitespaces, I removed this also.

N-gram Tokenization

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

An n-gram of size 1 is referred to as a “unigram”, size 2 is a “bigram” and size 3 is a “trigram”.

The RWeka package has been used to develop the N-gram Tokenizersin order to create the unigram, bigram and trigram.

Exploratory Analysis

Know I’m ready to perform exploratory analysis on the data. It will be helpful to find the most frequenzies of occurring words based on on unigram, bigram and trigrams.


term freq
will will 3124
said said 3048
just just 3019
one one 2974
like like 2953
get get 2949
time time 2598
can can 2465
day day 2277
year year 2127

plot of chunk unigrams


term freq
right now right now 270
last year last year 220
look like look like 217
cant wait cant wait 193
new york new york 186
last night last night 167
year ago year ago 162
look forward look forward 154
feel like feel like 150
high school high school 150

plot of chunk bigrams


term freq
happi mother day happi mother day 46
cant wait see cant wait see 43
new york citi new york citi 30
happi new year happi new year 28
let us know let us know 21
look forward see look forward see 20
cinco de mayo cinco de mayo 17
two year ago two year ago 17
new york time new york time 16
im pretti sure im pretti sure 14

plot of chunk trigrams

Development Plan

The next steps of this capstone project would be to create predictive models(s) based on the N-gram Tokenization, and deploy it as a data product. Here are my steps:

  • Establish the predictive model(s) by using N-gram Tokenizations.
  • Optimize the code for faster processing.
  • Develop data product, a Shiny App, to make a next word prediction based on user inputs.
  • Create a Slide Deck for pitching my algorithm and Shiny App.


Appendix – Load libraries, doParallel and files

Appendix A – File details and stats

Appendix B – Sample the data

Appendix C – Preprocessing the data

Appendix D – N-gram Tokenization

Appendix E – Exploratory Analysis

Related Posts