XGBoost Classification¶

Breast Cancer Wisconsin¶

Installing XGBoost¶

!pip install xgboost

Requirement already satisfied: xgboost in /home/bahar/anaconda3/lib/python3.7/site-packages (1.2.1)
Requirement already satisfied: scipy in /home/bahar/anaconda3/lib/python3.7/site-packages (from xgboost) (1.5.0)
Requirement already satisfied: numpy in /home/bahar/anaconda3/lib/python3.7/site-packages (from xgboost) (1.18.5)

Attribute Information:¶

1) ID number
2) Diagnosis (M = malignant, B = benign)

3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

Importing the Relevant Libraries¶

import numpy as np
import pandas as pd

Importing the Dataset¶

url = "https://DataScienceSchools.github.io/Machine_Learning/Sklearn/Case_Study/Classification/BreastCancerWisconsin/BreastCancer.csv"

dataset = pd.read_csv(url)

dataset.head()

Declaring the Dependent & the Independent Variables¶

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Training the XGBoost Classification Model¶

from xgboost import XGBClassifier

model = XGBClassifier()

model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

Predicting the Test Set Results¶

y_pred = model.predict(X_test)

Confusion Matrix¶

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: ", accuracy, "\n\n Confusion Matrix:\n\n ", cm)

Accuracy is:  0.9473684210526315 

 Confusion Matrix:

  [[103   4]
 [  5  59]]

K-Fold Cross Validation¶

from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.29 %
Standard Deviation: 2.84 %

	Sample code number	Clump Thickness	Uniformity of Cell Size	Uniformity of Cell Shape	Marginal Adhesion	Single Epithelial Cell Size	Bare Nuclei	Bland Chromatin	Normal Nucleoli	Mitoses	Class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2