CatBoost Classification

Breast Cancer Wisconsin

Source

Installing CatBoost

In [1]:
!pip install catboost
Requirement already satisfied: catboost in /home/bahar/anaconda3/lib/python3.7/site-packages (0.24.3)
Requirement already satisfied: plotly in /home/bahar/anaconda3/lib/python3.7/site-packages (from catboost) (4.9.0)
Requirement already satisfied: graphviz in /home/bahar/anaconda3/lib/python3.7/site-packages (from catboost) (0.15)
Requirement already satisfied: matplotlib in /home/bahar/anaconda3/lib/python3.7/site-packages (from catboost) (3.2.2)
Requirement already satisfied: numpy>=1.16.0 in /home/bahar/anaconda3/lib/python3.7/site-packages (from catboost) (1.18.5)
Requirement already satisfied: scipy in /home/bahar/anaconda3/lib/python3.7/site-packages (from catboost) (1.5.0)
Requirement already satisfied: pandas>=0.24.0 in /home/bahar/anaconda3/lib/python3.7/site-packages (from catboost) (1.0.5)
Requirement already satisfied: six in /home/bahar/anaconda3/lib/python3.7/site-packages (from catboost) (1.15.0)
Requirement already satisfied: retrying>=1.3.3 in /home/bahar/anaconda3/lib/python3.7/site-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /home/bahar/anaconda3/lib/python3.7/site-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from matplotlib->catboost) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from matplotlib->catboost) (1.2.0)
Requirement already satisfied: pytz>=2017.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from pandas>=0.24.0->catboost) (2020.1)

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)

3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

Importing the Relevant Libraries

In [2]:
import numpy as np
import pandas as pd

Importing the Dataset

In [3]:
url = "https://DataScienceSchools.github.io/Machine_Learning/Sklearn/Case_Study/Classification/BreastCancerWisconsin/BreastCancer.csv"

dataset = pd.read_csv(url)

dataset.head()
Out[3]:
Sample code number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2

Declaring the Dependent & the Independent Variables

In [4]:
X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Training the CatBoost Classification Model

In [ ]:
from catboost import CatBoostClassifier

model = CatBoostClassifier()

model.fit(X_train, y_train)

Predicting the Test Set Results

In [7]:
y_pred = model.predict(X_test)

Confusion Matrix

In [8]:
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: ", accuracy, "\n\n Confusion Matrix:\n\n ", cm)
Accuracy is:  0.9473684210526315 

 Confusion Matrix:

  [[103   4]
 [  5  59]]

K-Fold Cross Validation

In [10]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 97.26 %
Standard Deviation: 2.18 %