# Titanic – Machine Learning from Disaster (Part 1)

## Synopsis

In the challenge Titanic – Machine Learning from Disaster from Kaggle, you need to predict of what kind of people were likely to survive the disaster or did not. In particular, they ask to apply the tools of machine learning to predict which passengers survived the tragedy.

I’ve split this up into two seperate parts.

Part 1 – Data Exploration and basic Model Building
Part 2 – Creating own variables

## Data Exploration

Import the training and testing set into R.

Let’s have a look at the data.

The training set has 891 observations and 12 variables and the testing set has 418 observations and 11 variables. The traning set has 1 extra varible. Check which which one we are missing. I know we could see that in a very small dataset like this, but if its larger we want two compare them.

As we can see we are missing the `Survived` in the test set. Which is correct because thats our challenge, we must predict this by creating a model.

Let’s look deeper into the training set, and check how many passengers that survived vs did not make it.

Hmm oke, of the 891 there are only 342 who survived it. Check also as proportions.

A little more than one-third of the passengers survived the disaster. Now see if there is a difference between males and females that survived vs males that passed away.

As we can see most of the female survived and most of the male did not make it.

## Model Building

After doing some exploratory analysis of the data, let’s do some first prediction before getting deeper into the data.

### First prediction – All Female Survived

Create a copy of `test` to `test_female`, Initialize a `Survived` column to 0 and Set `Survived` to 1 if `Sex` equals “female”

Create a data frame with two columns: PassengerId & Survived and write the solution away to a csv file.

That’s our first submission to Kaggle and it’s good for a score of 0.76555. That’s not so bad, but we want more!! 🙂

### Clean up the dataset

Now we need to clean the dataset to create our models. Note that it is important to explore the data so that we understand what elements need to be cleaned.
For example we have noticed that there are missing values in the data set, especially in the `Age` column of the training set. Show which columns have missing values in the training and test set.

As we can see we have missing values in `Age` in the training set and `Age`, `Fare` in the test set.

To tackle the missing values I’m going to predict the missing values with the full data set. First we need to combine the test and training set together.

First we tackle the missing `Fare`, because this is only one value. Let see in wich row it’s missing.

As we can see the passenger on row 1044 has an NA Fare value. Let’s replace it with the median fare value.

How to fill in missing `Age` values? We make a prediction of a passengers Age using the other variables and a decision tree model.
This time we give method = “anova” since you are predicting a continuous variable.

We know that the training set has 891 observations and the test set 418, we can split the data back into a train set and a test set.

### Build a Decision Tree with rpart

Build the decision tree with rpart to predict `Survived` with the variables `Pclass`, `Sex`, `Age`, `SibSp`, `Parch`, `Fare` and `Embarked`.

Load in the packages to create a fancified visualized version of your tree.

Visualize the decision tree using fancy tree of rpart.

From the top we can see that the node is voting 0, so at this level everyone would die. Below that we see that 62% of passengers die, while 38% survive (the most will die here that’s why the node is voting that everyone die). If we go down to the male/female 81%/26% will die and 19%/74% will survive as the proportions exactly match those we find earlier. Let’s see the proportions again rounded with two decimals.

That are the same number’s 🙂

Make the prediction using the test2 set.

Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions.

Check that your data frame has 418 entries.

Write your solution to a csv file with the name my_dt1.csv.

This gives u a score of 0.77512, this is a little better than our first submission.

Create a new decision tree `my_dt2` with some control aspects. The aspects are `cp` for splitting up of the decision tree stops and `minsplit` for the amount of observations in a bucket.

Make the prediction using the `test2`, create the two column dataset, check the amount of rows and save it to `my_dt2.csv`.