Titanic — Machine Learning from Disaster!

Mayank Jha
4 min readJan 29, 2021

Predicting survival on the Titanic and get familiar with ML basics

This is the legendary Titanic ML competition — the best, first challenge for you to dive into ML competitions and familiarise yourself with how the Kaggle platform works.The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

The first step was to Import the dataset and then move on to Data visualization which I achieved using pandas_profiling. The report can also be exported as an html using report.to_file command. This makes it easier to analyse the report later. Next comes the Pre-processing part which is the arguably the most important part in the entire process.

The Pre-processing can be divided into two parts. In the first part we take care of Missing Values and in the second we encode categorical data.After visualized our entire dataset, check the describe above or the Sample in our dataframe report, you see that there are certain data points labeled with a NaN. These denote missing values. Different datasets encode missing values in different ways. Sometimes it may be a 9999, other times a0 - because real world data can be very messy!

Important: Missing data is information that is missing from a database and could be important for the result of an analysis. Working with a dataset with missing values is a problem of great relevance at the time of data analysis and can originate from different sources, such as failures in the collection system, problems with the integration of different sources, etc., the point is: we must be careful to avoid bias in the results we seek.

The goal here is to figure out how best to process the data so our machine learning model can learn from it.

Look at numeric and categorical values separately:

Numerical Features: Age, Fare, SibSp, Parch.

Categorical Features: Survived, Sex, Embarked, Pclass.

Alphanumeric Features (but categorical): Ticket, Cabin.

In our overview report, click on the tab “Warnings”:

  • Tickets and Cabin are features with a high cardinality, and a lot of distinct values.
  • Age and Cabin has a lot of missing values.
  • Name and ID has unique values.
  • SibSp, Parch and Fare has a lot of zeros.

Feature Encoding

Documentation: pandas.get_dummies

Split the Train & Test datasets

Split the dataset into training and testing is very common, and you will do it on countless occasions. Even though in this current problem, we have our training and test csv separately, we will use this technique in our training dataset, so we can get used to it.

train_test_split: The first argument will be the feature data, the second the target or labels. The test_size keyword argument specifies what proportion of the original data is used for the test set. Lastly, the random_state kward sets a seed for the random number generator that splits the data into trains and test.

Splitting the Training Data we will use part of our training data (30% in this case) to test the accuracy of our different models.

Building our Machine Learning Models

Logistic Regression

Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative distribution function of logistic distribution.

Decision Tree

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity

Following were the results obtained using the above two models and few additional ones:

Logistic Regression:82.4
Decision Trees:79.03
Random Forest Accuracy 82.4
K-Nearest Neighbors Accuracy 76.78
Support Vector Machine 67.04

The source code is available on github : https://github.com/mayankjha-purdue/data_science/blob/master/titanic_kaggle.ipynb

--

--

Mayank Jha

Hi, I am a Data Analytics student at Purdue University. I intend to use this platform to showcase and learn from people in the data science community.