Logistic Regression

Titanic Survival

The sinking of the Titanic on April 15th, 1912 is one of the most tragic tragedies in history. The Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers. The numbers of survivors were low due to the lack of lifeboats for all passengers and crew. Some passengers were more likely to survive than others, such as women, children, and upper-class. This case study analyzes what sorts of people were likely to survive this tragedy. The dataset includes the following:

  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Sex: Sex
  • Age: Age in years
  • Sibsp: # of siblings / spouses aboard the Titanic
  • Parch: # of parents / children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
  • Target class: Survived: Survival (0 = No, 1 = Yes)

Download Dataset

Importing the Relevant Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Importing the Dataset

In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/Classification_Models_CaseStudies/Train_Titanic.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Checking Missing Values

- Cabin & Embarked are unnecessary columns -> drop them after data visualisation
In [3]:
df.isnull().sum()
Out[3]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Checking Missing Values (Heatmap)

In [4]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

plt.show()

Handling Missing Values (Age)

In [5]:
def Fill_Age(data):
    age = data[0]
    sex = data[1]

    if pd.isnull(age):
        if sex is 'male': 
            return 29
        else:
            return 27
    else:
        return age
    

df['Age'] = df[['Age','Sex']].apply(Fill_Age,axis=1)

df.isnull().sum()
Out[5]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Explore the Dataset

Percentage of passengers Survived / Not Survived

In [6]:
survived = df[df['Survived'] == 1]

not_survived = df[df['Survived'] == 0]

print("Total =", len(df))

print("\nNumber of Survived passengers =", len(survived))
print("Percentage Survived = {:.2f}%".format(len(survived)*100/len(df)))
 
print("\nDid not Survive =", len(not_survived))
print("Percentage who did not survive = {:.2f}%".format(len(not_survived)*100/len(df)))
Total = 891

Number of Survived passengers = 342
Percentage Survived = 38.38%

Did not Survive = 549
Percentage who did not survive = 61.62%

Source

In [7]:
!pip install cufflinks
Requirement already satisfied: cufflinks in /home/bahar/anaconda3/lib/python3.7/site-packages (0.17.3)
Requirement already satisfied: colorlover>=0.2.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from cufflinks) (0.3.0)
Requirement already satisfied: ipython>=5.3.0 in /home/bahar/.local/lib/python3.7/site-packages (from cufflinks) (7.16.1)
Requirement already satisfied: setuptools>=34.4.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from cufflinks) (49.2.1)
Requirement already satisfied: plotly>=4.1.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from cufflinks) (4.9.0)
Requirement already satisfied: numpy>=1.9.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from cufflinks) (1.18.5)
Requirement already satisfied: ipywidgets>=7.0.0 in /home/bahar/anaconda3/lib/python3.7/site-packages (from cufflinks) (7.5.1)
Requirement already satisfied: pandas>=0.19.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from cufflinks) (1.2.4)
Requirement already satisfied: six>=1.9.0 in /home/bahar/anaconda3/lib/python3.7/site-packages (from cufflinks) (1.15.0)
Requirement already satisfied: pickleshare in /home/bahar/.local/lib/python3.7/site-packages (from ipython>=5.3.0->cufflinks) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /home/bahar/.local/lib/python3.7/site-packages (from ipython>=5.3.0->cufflinks) (3.0.5)
Requirement already satisfied: decorator in /home/bahar/.local/lib/python3.7/site-packages (from ipython>=5.3.0->cufflinks) (4.4.2)
Requirement already satisfied: pygments in /home/bahar/.local/lib/python3.7/site-packages (from ipython>=5.3.0->cufflinks) (2.6.1)
Requirement already satisfied: backcall in /home/bahar/.local/lib/python3.7/site-packages (from ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: jedi>=0.10 in /home/bahar/.local/lib/python3.7/site-packages (from ipython>=5.3.0->cufflinks) (0.17.1)
Requirement already satisfied: traitlets>=4.2 in /home/bahar/.local/lib/python3.7/site-packages (from ipython>=5.3.0->cufflinks) (4.3.3)
Requirement already satisfied: pexpect in /home/bahar/anaconda3/lib/python3.7/site-packages (from ipython>=5.3.0->cufflinks) (4.8.0)
Requirement already satisfied: ipykernel>=4.5.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from ipywidgets>=7.0.0->cufflinks) (5.3.2)
Requirement already satisfied: nbformat>=4.2.0 in /home/bahar/anaconda3/lib/python3.7/site-packages (from ipywidgets>=7.0.0->cufflinks) (5.0.7)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /home/bahar/anaconda3/lib/python3.7/site-packages (from ipywidgets>=7.0.0->cufflinks) (3.5.1)
Requirement already satisfied: jupyter-client in /home/bahar/anaconda3/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (6.1.6)
Requirement already satisfied: tornado>=4.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (6.0.4)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in /home/bahar/.local/lib/python3.7/site-packages (from jedi>=0.10->ipython>=5.3.0->cufflinks) (0.7.0)
Requirement already satisfied: jupyter-core in /home/bahar/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (4.6.3)
Requirement already satisfied: ipython-genutils in /home/bahar/.local/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /home/bahar/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (3.2.0)
Requirement already satisfied: attrs>=17.4.0 in /home/bahar/anaconda3/lib/python3.7/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (19.3.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /home/bahar/anaconda3/lib/python3.7/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (0.16.0)
Requirement already satisfied: importlib-metadata in /home/bahar/anaconda3/lib/python3.7/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (1.7.0)
Requirement already satisfied: pytz>=2017.3 in /home/bahar/anaconda3/lib/python3.7/site-packages (from pandas>=0.19.2->cufflinks) (2020.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/bahar/anaconda3/lib/python3.7/site-packages (from pandas>=0.19.2->cufflinks) (2.8.1)
Requirement already satisfied: retrying>=1.3.3 in /home/bahar/anaconda3/lib/python3.7/site-packages (from plotly>=4.1.1->cufflinks) (1.3.3)
Requirement already satisfied: wcwidth in /home/bahar/.local/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.3.0->cufflinks) (0.2.5)
Requirement already satisfied: notebook>=4.4.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (6.0.3)
Requirement already satisfied: nbconvert in /home/bahar/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (5.6.1)
Requirement already satisfied: prometheus-client in /home/bahar/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.8.0)
Requirement already satisfied: Send2Trash in /home/bahar/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.5.0)
Requirement already satisfied: terminado>=0.8.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.8.3)
Requirement already satisfied: pyzmq>=17 in /home/bahar/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (19.0.1)
Requirement already satisfied: jinja2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (2.11.2)
Requirement already satisfied: zipp>=0.5 in /home/bahar/anaconda3/lib/python3.7/site-packages (from importlib-metadata->jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (3.1.0)
Requirement already satisfied: MarkupSafe>=0.23 in /home/bahar/anaconda3/lib/python3.7/site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.1.1)
Requirement already satisfied: mistune<2,>=0.8.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.8.4)
Requirement already satisfied: bleach in /home/bahar/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (3.1.5)
Requirement already satisfied: pandocfilters>=1.4.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.4.2)
Requirement already satisfied: defusedxml in /home/bahar/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.6.0)
Requirement already satisfied: entrypoints>=0.2.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.3)
Requirement already satisfied: testpath in /home/bahar/anaconda3/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.4.4)
Requirement already satisfied: packaging in /home/bahar/anaconda3/lib/python3.7/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (20.4)
Requirement already satisfied: webencodings in /home/bahar/anaconda3/lib/python3.7/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.5.1)
Requirement already satisfied: pyparsing>=2.0.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from packaging->bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (2.4.7)
Requirement already satisfied: ptyprocess>=0.5 in /home/bahar/anaconda3/lib/python3.7/site-packages (from pexpect->ipython>=5.3.0->cufflinks) (0.6.0)
In [8]:
import cufflinks as cf

cf.go_offline()

Number of People Survived / Not Survived

In [9]:
survived = df[df['Survived']==1]['Survived'].value_counts()

dead = df[df['Survived']==0]['Survived'].value_counts()

df1 = pd.DataFrame([survived ,dead])

df1.index = ['Survived','Dead']

df1.iplot(kind='bar',barmode='stack', title='Number of Survived & Dead')

Number of People Survived based on Sex

- If you are a female, 

    - you have a higher chance of survival
In [10]:
survived_sex = df[df['Survived']==1]['Sex'].value_counts()

dead_sex = df[df['Survived']==0]['Sex'].value_counts()

df1 = pd.DataFrame([survived_sex,dead_sex])

df1.index = ['Survived','Dead']

df1.iplot(kind='bar',barmode='stack', title='Survival by Sex')

Number of People Survived based on Class

- If you are a first class

     - you have a higher chance of survival
In [11]:
survived_pclass = df[df['Survived']==1]['Pclass'].value_counts()

dead_pclass = df[df['Survived']==0]['Pclass'].value_counts()

df1 = pd.DataFrame([survived_pclass, dead_pclass])

df1.index = ['Survived','Dead']

df1.iplot(kind='bar',barmode='stack', title='Survival by Pclass')

Number of People Survived based on Siblings Status

- If you have 1 sibling (SibSp = 1)

    - you have a higher chance of survival compared to being alone (Parch = 0)
In [12]:
survived_SibSp = df[df['Survived']==1]['SibSp'].value_counts()

dead_SibSp = df[df['Survived']==0]['SibSp'].value_counts()

df1 = pd.DataFrame([survived_SibSp, dead_SibSp])

df1.index = ['Survived','Dead']

df1.iplot(kind='bar',barmode='stack', title='Survival by Number of siblings / spouses aboard the Titanic')

Number of People Survived based on Parch Status (Parents/Children onboard)

- If you have 1 family member (Parch = 1)

    - you have a higher chance of survival compared to being alone (Parch = 0)
In [13]:
survived_Parch = df[df['Survived']==1]['Parch'].value_counts()

dead_Parch = df[df['Survived']==0]['Parch'].value_counts()

df1 = pd.DataFrame([survived_Parch, dead_Parch])

df1.index = ['Survived','Dead']

df1.iplot(kind='bar',barmode='stack', title='Survival by Number of parents / children aboard the Titanic')

Number of People Survived based on the port they emparked from

- Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

- If you embarked from port "C"

    - you have a higher chance of survival compared to other ports!
In [14]:
survived_Embarked = df[df['Survived']==1]['Embarked'].value_counts()

dead_Embarked = df[df['Survived']==0]['Embarked'].value_counts()

df1 = pd.DataFrame([survived_Embarked, dead_Embarked])

df1.index = ['Survived','Dead']

df1.iplot(kind='bar',barmode='stack', title='Survival by Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton')

Number of People Survived based on Age

- If you are a baby

    - you have a higher chance of survival

Grouping Age

In [15]:
df['Age_Group'] = pd.cut(df['Age'], bins=[0,5,10,20,30,40,50,60,70,80])

df.head()
Out[15]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_Group
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S (20, 30]
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C (30, 40]
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S (20, 30]
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S (30, 40]
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S (30, 40]

Survival by Age Group

In [16]:
survived_Age_Group = df[df['Survived']==1]['Age_Group'].value_counts()

dead_Age_Group = df[df['Survived']==0]['Age_Group'].value_counts()

df1 = pd.DataFrame([survived_Age_Group, dead_Age_Group])

df1.index = ['Survived','Dead']

df['Age'].iplot(kind='hist',bins=30, xTitle='Age',color='skyblue')

df['Age'].iplot(kind='box', xTitle='Age',color='lightgreen')

df1.iplot(kind='bar',barmode='stack', title='Survival by Age Group')

Number of People Survived based on Fare

- If you pay a higher fare

    - you have a higher chance of survival

Grouping Fare

In [17]:
df['Fare_Group'] = pd.cut(df['Fare'], bins=[0, 50, 100, 200, 300, 600])

df.head()
Out[17]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_Group Fare_Group
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S (20, 30] (0, 50]
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C (30, 40] (50, 100]
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S (20, 30] (0, 50]
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S (30, 40] (50, 100]
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S (30, 40] (0, 50]

Survival by Fare Group

In [18]:
survived_Fare_Group = df[df['Survived']==1]['Fare_Group'].value_counts()

dead_Fare_Group = df[df['Survived']==0]['Fare_Group'].value_counts()

df1 = pd.DataFrame([survived_Fare_Group, dead_Fare_Group])

df1.index = ['Survived','Dead']

df['Fare'].iplot(kind='hist',bins=30, xTitle='Fare', color='lightgreen')

df['Fare'].iplot(kind='box', xTitle='Age',color='lightgreen')

df1.iplot(kind='bar',barmode='stack', title='Survival by Fare Group')

Handling Categorical Data - Dummy Variable(Sex)

- male: 1
- female: 0 
In [19]:
df['Male'] = pd.get_dummies(df['Sex'], drop_first = True)

df.head()
Out[19]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_Group Fare_Group Male
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S (20, 30] (0, 50] 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C (30, 40] (50, 100] 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S (20, 30] (0, 50] 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S (30, 40] (50, 100] 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S (30, 40] (0, 50] 1

Drop Unnecessary Columns

In [20]:
df.drop(['PassengerId','Name', 'Sex','Ticket','Cabin', 'Embarked', 'Age_Group', 'Fare_Group' ], axis = 1 , inplace = True)
        
df.head()        
Out[20]:
Survived Pclass Age SibSp Parch Fare Male
0 0 3 22.0 1 0 7.2500 1
1 1 1 38.0 1 0 71.2833 0
2 1 3 26.0 0 0 7.9250 0
3 1 1 35.0 1 0 53.1000 0
4 0 3 35.0 0 0 8.0500 1

Rearrange Columns

In [21]:
df = df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Male', 'Survived']]

df.head()
Out[21]:
Pclass Age SibSp Parch Fare Male Survived
0 3 22.0 1 0 7.2500 1 0
1 1 38.0 1 0 71.2833 0 1
2 3 26.0 0 0 7.9250 0 1
3 1 35.0 1 0 53.1000 0 1
4 3 35.0 0 0 8.0500 1 0

Declaring the Dependent & the Independent Variables

In [22]:
X = df.iloc[:,:-1].values

y = df.iloc[:,-1].values

Splitting the Dataset into the Training Set and Test Set

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 11)

Feature Scaling

In [24]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Training the Logistic Regression Model

In [25]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state = 0)

model.fit(X_train, y_train)
Out[25]:
LogisticRegression(random_state=0)

Predicting the Test Set Results

In [26]:
y_pred = model.predict(X_test)

Confusion Matrix

In [27]:
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: {:.2f}%".format(accuracy*100))

sns.heatmap(cm, annot = True, fmt="d")

plt.show()
Accuracy is: 84.36%

Classification Report

In [28]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.89      0.87      0.88       118
           1       0.76      0.79      0.77        61

    accuracy                           0.84       179
   macro avg       0.82      0.83      0.83       179
weighted avg       0.84      0.84      0.84       179

k-Fold Cross Validation

In [29]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 77.67 %
Standard Deviation: 5.49 %

Making Prediction (New Dataset)

Importing New Dataset

In [30]:
url = "https://datascienceschools.github.io/Machine_Learning/Classification_Models_CaseStudies/Test_Titanic.csv"

new_data = pd.read_csv(url)

new_data.head()
Out[30]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

Dropping unnecessary columns from New Dataset

In [31]:
new_data.drop(['PassengerId','Name', 'Ticket','Cabin', 'Embarked' ], axis = 1 , inplace = True)
        
new_data.head()       
Out[31]:
Pclass Sex Age SibSp Parch Fare
0 3 male 34.5 0 0 7.8292
1 3 female 47.0 1 0 7.0000
2 2 male 62.0 0 0 9.6875
3 3 male 27.0 0 0 8.6625
4 3 female 22.0 1 1 12.2875

Checking Missing Values

In [32]:
new_data.isnull().sum()
Out[32]:
Pclass     0
Sex        0
Age       86
SibSp      0
Parch      0
Fare       1
dtype: int64

Handling Missing Values ( Age & Fare)

In [33]:
def Fill_Age(data):
    age = data[0]
    sex = data[1]

    if pd.isnull(age):
        if sex is 'male': 
            return 29
        else:
            return 27
    else:
        return age
    

new_data['Age'] = new_data[['Age','Sex']].apply(Fill_Age,axis=1)

new_data = new_data.dropna(axis=0) 

new_data.isnull().sum()
Out[33]:
Pclass    0
Sex       0
Age       0
SibSp     0
Parch     0
Fare      0
dtype: int64

Handling Categorical Data - Dummy Variable(Sex)

In [34]:
new_data['Male'] = pd.get_dummies(new_data['Sex'], drop_first = True)

new_data.drop(['Sex'], axis = 1, inplace = True)
         
new_data.head()
Out[34]:
Pclass Age SibSp Parch Fare Male
0 3 34.5 0 0 7.8292 1
1 3 47.0 1 0 7.0000 0
2 2 62.0 0 0 9.6875 1
3 3 27.0 0 0 8.6625 1
4 3 22.0 1 1 12.2875 0

Declaring Independent Variables

In [35]:
new_data_X = new_data.iloc[:,:].values

Feature Scaling (New Data)

In [36]:
new_data_X = sc.transform(new_data_X)

Predicting Dependent Variable (Survived)

In [37]:
new_data_y_pred = model.predict(new_data_X)

new_data['predicted_Survive'] = new_data_y_pred

new_data.head()
Out[37]:
Pclass Age SibSp Parch Fare Male predicted_Survive
0 3 34.5 0 0 7.8292 1 0
1 3 47.0 1 0 7.0000 0 0
2 2 62.0 0 0 9.6875 1 0
3 3 27.0 0 0 8.6625 1 0
4 3 22.0 1 1 12.2875 0 1

Number of Predicted Survive / Not Servive

In [38]:
survive = new_data[new_data['predicted_Survive']==1]['predicted_Survive'].value_counts()

not_survive = new_data[new_data['predicted_Survive']==0]['predicted_Survive'].value_counts()

df1 = pd.DataFrame([survive , not_survive ])

df1.index = ['Survive','Not Survive']

df1.iplot(kind='bar',barmode='stack', title='Number of Predicted Survive & Not Survive')