Decision Tree Classification

KyphosisDisease

- Kyphosis is an abnormally excessive convex curvature of the spine. The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery. Dataset contains 3 inputs and 1 output

INPUTS: 
- Age: in months
- Number: the number of vertebrae involved
- Start: the number of the first (topmost) vertebra operated on.

OUTPUTS:
- Kyphosis: a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.

Download Dataset

Source1: John M. Chambers and Trevor J. Hastie eds. (1992) Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA.

source2: Dr. Ryan @STEMplicity

Importing the Relevant Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Importing the Dataset

In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/Classification_Models_CaseStudies/kyphosis.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
Kyphosis Age Number Start
0 absent 71 3 5
1 absent 158 3 14
2 present 128 4 5
3 absent 2 5 1
4 absent 1 4 15

Rearranging Columns

In [3]:
df = df[['Age', 'Number', 'Start', 'Kyphosis']]

df.head()
Out[3]:
Age Number Start Kyphosis
0 71 3 5 absent
1 158 3 14 absent
2 128 4 5 present
3 2 5 1 absent
4 1 4 15 absent

Label Encoding the Dependent Variable

In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

LabelEncoder_y = LabelEncoder()

df['Kyphosis'] = LabelEncoder_y.fit_transform(df['Kyphosis'])

Percentage of Disease present/absent after operation

In [5]:
Kyphosis_True = df[df['Kyphosis']== 1]

Kyphosis_False = df[df['Kyphosis']== 0]

print('Total:', len(df))

print('\nKyphosis Present:', len(Kyphosis_True))
print( 'Disease present after operation percentage = {:.2f} %'.format((len(Kyphosis_True) / len(df))*100))

print('\nKyphosis Absent:', len(Kyphosis_False))
print( 'Disease absent after operation percentage = {:.2f} %'.format((len(Kyphosis_False) / len(df))*100))

sns.countplot(df['Kyphosis'], palette='Set1') 

plt.show()
Total: 81

Kyphosis Present: 17
Disease present after operation percentage = 20.99 %

Kyphosis Absent: 64
Disease absent after operation percentage = 79.01 %

Heatmap (The Relationship between Variables)

In [6]:
corr= df.corr()

matrix = np.triu(corr)

sns.heatmap(corr, annot=True, mask=matrix,)

plt.show()

Declaring the Dependent & the Independent Variables

In [7]:
X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 8)

Feature Scaling

In [9]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Training the Decision Tree Classification Model

In [10]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(X_train, y_train)
Out[10]:
DecisionTreeClassifier()

Feature Importances

In [11]:
df_feature = df.drop('Kyphosis', axis=1)

feature_importances = pd.DataFrame(data = df_feature.columns.values, columns = ['Features'])

feature_importances['Importance'] =  model.feature_importances_

feature_importances.sort_values('Importance',ascending=False)
Out[11]:
Features Importance
2 Start 0.424120
0 Age 0.369385
1 Number 0.206494

Predicting the Test Set Results

In [12]:
y_pred = model.predict(X_test)

Confusion Matrix

In [13]:
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: {:.2f} %".format(accuracy*100))

sns.heatmap(cm, annot=True, fmt='d')

plt.show()
Accuracy is: 64.71 %

Classification Report

In [14]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.90      0.64      0.75        14
           1       0.29      0.67      0.40         3

    accuracy                           0.65        17
   macro avg       0.59      0.65      0.57        17
weighted avg       0.79      0.65      0.69        17

K-Fold Cross Validation

In [15]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 9)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 74.80 %
Standard Deviation: 14.88 %

Improving the Model

- Use Random Forest classification algorithm to get higher accuracy

- Random forests are 

    - a strong modeling technique 

    - much more robust than a single decision tree

    - They aggregate many decision trees to limit overfitting as well as error

            due to bias and therefore yield useful results

source