XGBoost Classification

Breast Cancer Wisconsin

Source

Installing XGBoost

In [1]:
!pip install xgboost
Requirement already satisfied: xgboost in /home/bahar/anaconda3/lib/python3.7/site-packages (1.2.1)
Requirement already satisfied: scipy in /home/bahar/anaconda3/lib/python3.7/site-packages (from xgboost) (1.5.0)
Requirement already satisfied: numpy in /home/bahar/anaconda3/lib/python3.7/site-packages (from xgboost) (1.18.5)

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)

3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

Importing the Relevant Libraries

In [2]:
import numpy as np
import pandas as pd

Importing the Dataset

In [3]:
url = "https://DataScienceSchools.github.io/Machine_Learning/Sklearn/Case_Study/Classification/BreastCancerWisconsin/BreastCancer.csv"

dataset = pd.read_csv(url)

dataset.head()
Out[3]:
Sample code number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2

Declaring the Dependent & the Independent Variables

In [4]:
X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Training the XGBoost Classification Model

In [6]:
from xgboost import XGBClassifier

model = XGBClassifier()

model.fit(X_train, y_train)
Out[6]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

Predicting the Test Set Results

In [7]:
y_pred = model.predict(X_test)

Confusion Matrix

In [8]:
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: ", accuracy, "\n\n Confusion Matrix:\n\n ", cm)
Accuracy is:  0.9473684210526315 

 Confusion Matrix:

  [[103   4]
 [  5  59]]

K-Fold Cross Validation

In [9]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 96.29 %
Standard Deviation: 2.84 %