Text Mining with R on Vikings episode scripts



I'm a hugh fan of the TV show Vikings. I thought it would be cool to mine the tv shows scripts to figure out which terms are the most used in the show and what the correlations are between the most frequent terms and episodes.

Who do not know this serie here is some information of Vikings

Getting the data

Before we can start we need to get the data. I have found a website with a lots of tv and movie scripts. All the scripts are embed in HTML code that we must extract. R has a package rvest for this and we will use it to get our data.

Rvest is a library that easily harvest (Scrape) web pages rvest.

The script is written as we can harvest another tv show if we want only with setting it to another tv show.

Setting our variables, such as, which tv show we want, download directory, base urls etc.

There are 4 seasons with 10 episodes each of Vikings, it do not want to scrape it piece by peice, so first we scrape all episode urls before we can downloding the scripts. Before that we need to explorer the websites source for which nodes to select.

Go to the urls you want to scrape, in my case this. In case of a Chrome browser right click the first epidose of season 1 and select inspect. You wil see the following:

Chrome Inspect

As you can see after the tag <h3> the first href tag of s01e01 with the class class="season-episode-title". This class we need to select as our node.

Show some structure of the all_url_seasons.

As we can see in the structure display we have 40 urls of the episodes. Now we have all variables and season urls, we can harvest the scripts and save them to seperate text files for doing our text mining.

Starting with Text Mining

Now that we have all Viking episode scripts we can do some text mining with the tm library.

There is a lot of information in the script we do not need and is not useful for text mining. We need to clean it up. We remove all numbers, convert text to lowercase, remove punctuation and stopwords, in this case english.

Now we will perform stemming, a stem is a form to which affixes can be attached. An example of this is wait, waits, waited, waiting, all of them are common to wait.

We have removed a lot of characters which resulted in a lot of whitespaces, we remove this also.

Let's have a look to our first document.

I have hash it because wordpress has problems with editing the post.

We are ready with preprosessing the data and turn the document back as plain text documents.

Create a Term Document Matrix of our documents. Which reflects the number of times each term in the corpus is found in each of the documents. And add some readable columnnmes.

Do the same for a Document Term Matrix (this is a transpose of a tdm)

Now we have done that we can ask questions about, what are de most frequently terms in the scripts and what are the associations between terms.

Term frequency

Let have a look of the most frequently terms first and show the top 20.

Plotting the terms frequencies

Add is to a data frame so we can plot it and show the top 20.

Let's plot it.

plot of chunk unnamed-chunk-16

The most frequent term is will, after that is the main actor ragnar.

Further Analysis

As we can see in our first look at the tdm, we have a lot op sparse terms in our documents (90%). That is a lot, lets remove these.

That is a 87% less sparsity. See how many terms we had and now have.

Hmm from 5608 terms to only 36 terms, we inspect the first 10 terms of the first 6 documents.

Let visualize these most common terms in a heatmap with ggplot. As ggplot works with a matrix we need to convert the tdm.comon to a matrix because the tdm is a spare matrix.

We need the data as a normal matrix in order to produce the visualisation.

Make the heatmap visualization.

plot of chunk unnamed-chunk-22

Okay what have we here? Now we can check which common terms are used in which season and episode of Vikings. The term will is common used in the most episodes excepts in s03e90. That's a strange episode, isn't it? (I've look that up, it no realy an episode it's a documentary of Vikings 🙂 )

Now we plot a correlogram of the episodes.

Note: Correlogram is a graph of correlation matrix. It is very useful to highlight the most correlated variables in a data table. In this plot, correlation coefficients is colored according to the value. Correlation matrix can be also reordered according to the degree of association between variables.

plot of chunk unnamed-chunk-23

Transpose the tdm.dense so we can plot a correlogram of the terms.

plot of chunk unnamed-chunk-24

Well we can do a lot more analysis, mining and visualisations, this is it for now.

3 thoughts on “Text Mining with R on Vikings episode scripts

Comments are closed.