Titanic – Machine Learning from Disaster (Part 1)


In the challenge Titanic – Machine Learning from Disaster from Kaggle, you need to predict of what kind of people were likely to survive the disaster or did not. In particular, they ask to apply the tools of machine learning to predict which passengers survived the tragedy.

I’ve split this up into two seperate parts.

Part 1 – Data Exploration and basic Model Building
Part 2 – Creating own variables

Data Exploration

I’ve download the train and test data from Kaggle. At this page you could also find the variable descriptions.

Import the training and testing set into R.

Let’s have a look at the data.

The training set has 891 observations and 12 variables and the testing set has 418 observations and 11 variables. The traning set has 1 extra varible. Check which which one we are missing. I know we could see that in a very small dataset like this, but if its larger we want two compare them.

As we can see we are missing the Survived in the test set. Which is correct because thats our challenge, we must predict this by creating a model.

Let’s look deeper into the training set, and check how many passengers that survived vs did not make it.

Hmm oke, of the 891 there are only 342 who survived it. Check also as proportions.

A little more than one-third of the passengers survived the disaster. Now see if there is a difference between males and females that survived vs males that passed away.

As we can see most of the female survived and most of the male did not make it.

Model Building

After doing some exploratory analysis of the data, let’s do some first prediction before getting deeper into the data.

First prediction – All Female Survived

Create a copy of test to test_female, Initialize a Survived column to 0 and Set Survived to 1 if Sex equals “female”

Create a data frame with two columns: PassengerId & Survived and write the solution away to a csv file.

That’s our first submission to Kaggle and it’s good for a score of 0.76555. That’s not so bad, but we want more!! 🙂

Clean up the dataset

Now we need to clean the dataset to create our models. Note that it is important to explore the data so that we understand what elements need to be cleaned.
For example we have noticed that there are missing values in the data set, especially in the Age column of the training set. Show which columns have missing values in the training and test set.

As we can see we have missing values in Age in the training set and Age, Fare in the test set.

To tackle the missing values I’m going to predict the missing values with the full data set. First we need to combine the test and training set together.

First we tackle the missing Fare, because this is only one value. Let see in wich row it’s missing.

As we can see the passenger on row 1044 has an NA Fare value. Let’s replace it with the median fare value.

How to fill in missing Age values? We make a prediction of a passengers Age using the other variables and a decision tree model.
This time we give method = “anova” since you are predicting a continuous variable.

We know that the training set has 891 observations and the test set 418, we can split the data back into a train set and a test set.

Build a Decision Tree with rpart

Build the decision tree with rpart to predict Survived with the variables Pclass, Sex, Age, SibSp, Parch, Fare and Embarked.

Load in the packages to create a fancified visualized version of your tree.

Visualize the decision tree using fancy tree of rpart.

plot of chunk my_dt1

From the top we can see that the node is voting 0, so at this level everyone would die. Below that we see that 62% of passengers die, while 38% survive (the most will die here that’s why the node is voting that everyone die). If we go down to the male/female 81%/26% will die and 19%/74% will survive as the proportions exactly match those we find earlier. Let’s see the proportions again rounded with two decimals.

That are the same number’s 🙂

Make the prediction using the test2 set.

Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions.

Check that your data frame has 418 entries.

Write your solution to a csv file with the name my_dt1.csv.

This gives u a score of 0.77512, this is a little better than our first submission.

Create a new decision tree my_dt2 with some control aspects. The aspects are cp for splitting up of the decision tree stops and minsplit for the amount of observations in a bucket.

Visualize your new decision tree.

plot of chunk my_dt2

Make the prediction using the test2, create the two column dataset, check the amount of rows and save it to my_dt2.csv.

This will gives us a score of 0.74163. Oke this is not an improvement.

In part two I will create my own variables for making a model.