Data Science Capstone – Milestone Report
Executive Summary
This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.This Milestone Report describes the major features of the data with my exploratory data analysis and summarizes. To get started with the Milestone Report I’ve download the Coursera Swiftkey Dataset. Also I’ve defind my plans for creating the predictive model(s) and a Shiny App as data product.
All the code is attached as Appendix.
Files used:
1 2 |
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt" |
File details and stats
Let’s have a look at the files. I determined the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter). Also I calculate some basic stats on the number of words per line (WPL).
File | Lines | LinesNEmpty | Chars | CharsNWhite | TotalWords | WPL_Min | WPL_Mean | WPL_Max |
---|---|---|---|---|---|---|---|---|
blogs | 899288 | 899288 | 206824382 | 170389539 | 37570839 | 0 | 41.75107 | 6726 |
news | 1010242 | 1010242 | 203223154 | 169860866 | 34494539 | 1 | 34.40997 | 1796 |
2360148 | 2360148 | 162096241 | 134082806 | 30451170 | 1 | 12.75065 | 47 |
Sample the data
The data files are very hugh, I will get a sample of 1% of every file and save it to RDS file sample.rds
for saving space. We can load it in for starting the analysis.
Preprocessing the data
After loading the sample RDS file, I created a Corpus and start to analyse the data with the tm
library.
There is a lot of information in the data I do not need and is not usefull. I need to clean it up and removed all numbers, convert text to lowercase, remove punctuation and stopwords, in this case english. After that. had I performed stemming, a stem is a form to which affixes can be attached. An example of this is wait, waits, waited, waiting, all of them are common to wait. When the stemming is done, I had removed a lot of characters which resulted in a lot of whitespaces, I removed this also.
N-gram Tokenization
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
An n-gram of size 1 is referred to as a “unigram”, size 2 is a “bigram” and size 3 is a “trigram”.
The RWeka package has been used to develop the N-gram Tokenizersin order to create the unigram, bigram and trigram.
Exploratory Analysis
Know I’m ready to perform exploratory analysis on the data. It will be helpful to find the most frequenzies of occurring words based on on unigram, bigram and trigrams.
Unigrams
term | freq | |
---|---|---|
will | will | 3124 |
said | said | 3048 |
just | just | 3019 |
one | one | 2974 |
like | like | 2953 |
get | get | 2949 |
time | time | 2598 |
can | can | 2465 |
day | day | 2277 |
year | year | 2127 |
Bigrams
term | freq | |
---|---|---|
right now | right now | 270 |
last year | last year | 220 |
look like | look like | 217 |
cant wait | cant wait | 193 |
new york | new york | 186 |
last night | last night | 167 |
year ago | year ago | 162 |
look forward | look forward | 154 |
feel like | feel like | 150 |
high school | high school | 150 |
Trigrams
term | freq | |
---|---|---|
happi mother day | happi mother day | 46 |
cant wait see | cant wait see | 43 |
new york citi | new york citi | 30 |
happi new year | happi new year | 28 |
let us know | let us know | 21 |
look forward see | look forward see | 20 |
cinco de mayo | cinco de mayo | 17 |
two year ago | two year ago | 17 |
new york time | new york time | 16 |
im pretti sure | im pretti sure | 14 |
Development Plan
The next steps of this capstone project would be to create predictive models(s) based on the N-gram Tokenization, and deploy it as a data product. Here are my steps:
- Establish the predictive model(s) by using N-gram Tokenizations.
- Optimize the code for faster processing.
- Develop data product, a Shiny App, to make a next word prediction based on user inputs.
- Create a Slide Deck for pitching my algorithm and Shiny App.
Appendix
Appendix – Load libraries, doParallel and files
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Loading Libraries library(doParallel) library(tm) library(stringi) library(RWeka) library(dplyr) library(kableExtra) library(SnowballC) library(ggplot2) # Setting up doParallel library(doParallel) set.seed(613) n_cores <- detectCores() - 2 registerDoParallel(n_cores,cores=n_cores) # Show files used directory_us <- file.path(".", "data", "final", "en_US/") dir(directory_us) |
Appendix A – File details and stats
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
#Loading Files and show summaries blogs_con <- file(paste0(directory_us, "/en_US.blogs.txt"), "r") blogs <- readLines(blogs_con, encoding="UTF-8", skipNul = TRUE) close(blogs_con) news_con <- file(paste0(directory_us, "/en_US.news.txt"), "r") news <- readLines(news_con, encoding="UTF-8", skipNul = TRUE) close(news_con) twitter_con <- file(paste0(directory_us, "/en_US.twitter.txt"), "r") twitter <- readLines(twitter_con, encoding="UTF-8", skipNul = TRUE) close(twitter_con) # Create stats of files WPL <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')]) rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max') rawstats <- data.frame( File = c("blogs","news","twitter"), t(rbind(sapply(list(blogs,news,twitter),stri_stats_general), TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,], WPL)) ) # Show stats in table kable(rawstats) %>% kable_styling(bootstrap_options = c("striped", "hover")) |
Appendix B – Sample the data
1 2 3 4 5 6 7 8 9 10 |
# Sample of data set.seed(613) data.sample <- c(sample(blogs, length(blogs) * 0.01), sample(news, length(news) * 0.01), sample(twitter, length(twitter) * 0.01)) saveRDS(data.sample, 'sample.rds') # Ceaning up a other object we do not use anymore. rm(blogs, blogs_con, data.sample, directory_us, news, news_con, rawstats, twitter, twitter_con, WPL) |
Appendix C – Preprocessing the data
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Load the RDS file data <- readRDS("sample.rds") # Create a Corpus docs <- VCorpus(VectorSource(data)) # Remove data we do not need docs <- tm_map(docs, tolower) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords, stopwords("english")) # Do stamming docs <- tm_map(docs, stemDocument) # Strip whitespaces docs <- tm_map(docs, stripWhitespace) |
Appendix D – N-gram Tokenization
1 2 3 4 5 6 7 |
# Create Tokenization funtions unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1)) bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) # Create plain text format docs <- tm_map(docs, PlainTextDocument) |
Appendix E – Exploratory Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# Create TermDocumentMatrix with Tokenizations and Remove Sparse Terms tdm_freq1 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = unigram)), 0.9999) tdm_freq2 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = bigram)), 0.9999) tdm_freq3 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = trigram)), 0.9999) # Create frequencies uni_freq <- sort(rowSums(as.matrix(tdm_freq1)), decreasing=TRUE) bi_freq <- sort(rowSums(as.matrix(tdm_freq2)), decreasing=TRUE) tri_freq <- sort(rowSums(as.matrix(tdm_freq3)), decreasing=TRUE) # Create DataFrames uni_df <- data.frame(term=names(uni_freq), freq=uni_freq) bi_df <- data.frame(term=names(bi_freq), freq=bi_freq) tri_df <- data.frame(term=names(tri_freq), freq=tri_freq) # Show head 10 of unigrams kable(head(uni_df,10))%>% kable_styling(bootstrap_options = c("striped", "hover")) # Plot head 20 of unigrams head(uni_df,20) %>% ggplot(aes(reorder(term,-freq), freq)) + geom_bar(stat = "identity") + ggtitle("20 Most Unigrams") + xlab("Unigrams") + ylab("Frequency") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1)) # Show head 10 of bigrams kable(head(bi_df,10))%>% kable_styling(bootstrap_options = c("striped", "hover")) # Plot head 20 of bigrams head(bi_df,20) %>% ggplot(aes(reorder(term,-freq), freq)) + geom_bar(stat = "identity") + ggtitle("20 Most Bigrams") + xlab("Bigrams") + ylab("Frequency") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1)) # Show head 10 of trigrams kable(head(tri_df,10))%>% kable_styling(bootstrap_options = c("striped", "hover")) # Plot head 20 of trigrams head(tri_df,20) %>% ggplot(aes(reorder(term,-freq), freq)) + geom_bar(stat = "identity") + ggtitle("20 Most Trigrams") + xlab("Trigrams") + ylab("Frequency") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1)) |
I’m creative, imaginative, free-thinking, daydreamer and strategic who needs freedom, peace and space to brainstorm and to fantasize about new and surprising solutions. Generates ideas and solves difficult problems, sees all options, judges accurately and wants to get to the bottom of things.
Interested in Data Science, Data Analytics, Running, Crossfit, Obstacle Running and Coffee.