Decision Tree Classification¶

KyphosisDisease¶

- Kyphosis is an abnormally excessive convex curvature of the spine. The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery. Dataset contains 3 inputs and 1 output

INPUTS: 
- Age: in months
- Number: the number of vertebrae involved
- Start: the number of the first (topmost) vertebra operated on.

OUTPUTS:
- Kyphosis: a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.

Download Dataset

Source1: John M. Chambers and Trevor J. Hastie eds. (1992) Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA.

source2: Dr. Ryan @STEMplicity

Importing the Relevant Libraries¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Importing the Dataset¶

url = "https://datascienceschools.github.io/Machine_Learning/Classification_Models_CaseStudies/kyphosis.csv"

df = pd.read_csv(url)

df.head()

Rearranging Columns¶

df = df[['Age', 'Number', 'Start', 'Kyphosis']]

df.head()

Label Encoding the Dependent Variable¶

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

LabelEncoder_y = LabelEncoder()

df['Kyphosis'] = LabelEncoder_y.fit_transform(df['Kyphosis'])

Percentage of Disease present/absent after operation¶

Kyphosis_True = df[df['Kyphosis']== 1]

Kyphosis_False = df[df['Kyphosis']== 0]

print('Total:', len(df))

print('\nKyphosis Present:', len(Kyphosis_True))
print( 'Disease present after operation percentage = {:.2f} %'.format((len(Kyphosis_True) / len(df))*100))

print('\nKyphosis Absent:', len(Kyphosis_False))
print( 'Disease absent after operation percentage = {:.2f} %'.format((len(Kyphosis_False) / len(df))*100))

sns.countplot(df['Kyphosis'], palette='Set1') 

plt.show()

Total: 81

Kyphosis Present: 17
Disease present after operation percentage = 20.99 %

Kyphosis Absent: 64
Disease absent after operation percentage = 79.01 %

Heatmap (The Relationship between Variables)¶

corr= df.corr()

matrix = np.triu(corr)

sns.heatmap(corr, annot=True, mask=matrix,)

plt.show()

Declaring the Dependent & the Independent Variables¶

X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 8)

Feature Scaling¶

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Training the Decision Tree Classification Model¶

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

DecisionTreeClassifier()

Feature Importances¶

df_feature = df.drop('Kyphosis', axis=1)

feature_importances = pd.DataFrame(data = df_feature.columns.values, columns = ['Features'])

feature_importances['Importance'] =  model.feature_importances_

feature_importances.sort_values('Importance',ascending=False)

Predicting the Test Set Results¶

y_pred = model.predict(X_test)

Confusion Matrix¶

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: {:.2f} %".format(accuracy*100))

sns.heatmap(cm, annot=True, fmt='d')

plt.show()

Accuracy is: 64.71 %

Classification Report¶

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.64      0.75        14
           1       0.29      0.67      0.40         3

    accuracy                           0.65        17
   macro avg       0.59      0.65      0.57        17
weighted avg       0.79      0.65      0.69        17

K-Fold Cross Validation¶

from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 9)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 74.80 %
Standard Deviation: 14.88 %

Improving the Model¶

- Use Random Forest classification algorithm to get higher accuracy

- Random forests are 

    - a strong modeling technique 

    - much more robust than a single decision tree

    - They aggregate many decision trees to limit overfitting as well as error

            due to bias and therefore yield useful results

source

	Features	Importance
2	Start	0.424120
0	Age	0.369385
1	Number	0.206494