- Kyphosis is an abnormally excessive convex curvature of the spine. The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery. Dataset contains 3 inputs and 1 output
INPUTS:
- Age: in months
- Number: the number of vertebrae involved
- Start: the number of the first (topmost) vertebra operated on.
OUTPUTS:
- Kyphosis: a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
Source1: John M. Chambers and Trevor J. Hastie eds. (1992) Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA.
source2: Dr. Ryan @STEMplicity
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
url = "https://datascienceschools.github.io/Machine_Learning/Classification_Models_CaseStudies/kyphosis.csv"
df = pd.read_csv(url)
df.head()
df = df[['Age', 'Number', 'Start', 'Kyphosis']]
df.head()
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
LabelEncoder_y = LabelEncoder()
df['Kyphosis'] = LabelEncoder_y.fit_transform(df['Kyphosis'])
Kyphosis_True = df[df['Kyphosis']== 1]
Kyphosis_False = df[df['Kyphosis']== 0]
print('Total:', len(df))
print('\nKyphosis Present:', len(Kyphosis_True))
print( 'Disease present after operation percentage = {:.2f} %'.format((len(Kyphosis_True) / len(df))*100))
print('\nKyphosis Absent:', len(Kyphosis_False))
print( 'Disease absent after operation percentage = {:.2f} %'.format((len(Kyphosis_False) / len(df))*100))
sns.countplot(df['Kyphosis'], palette='Set1')
plt.show()
corr= df.corr()
matrix = np.triu(corr)
sns.heatmap(corr, annot=True, mask=matrix,)
plt.show()
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 8)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
df_feature = df.drop('Kyphosis', axis=1)
feature_importances = pd.DataFrame(data = df_feature.columns.values, columns = ['Features'])
feature_importances['Importance'] = model.feature_importances_
feature_importances.sort_values('Importance',ascending=False)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy is: {:.2f} %".format(accuracy*100))
sns.heatmap(cm, annot=True, fmt='d')
plt.show()
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 9)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
- Use Random Forest classification algorithm to get higher accuracy
- Random forests are
- a strong modeling technique
- much more robust than a single decision tree
- They aggregate many decision trees to limit overfitting as well as error
due to bias and therefore yield useful results