SKLearrn (Logistic Regression)¶

- Logistic Regression (Logistic Model or Logit Model):

    -> a statistical model

    -> uses a logistic function to model a binary dependent variable

    -> models the probability of a certain class or event existing

        - such as pass/fail, win/lose, alive/dead or healthy/sick


    -> To get Logistic Function: log(p/1-p) = b0 + b1 * x

        - Apply sigmoid function to linear equation


    -> measures the relationship between

        - the categorical dependent variable and independent variables 
        - by estimating probabilities using a logistic function


- Analyze customer behavior 

   -> by predicting which customer will click on the advs based on customer features

           - such as salary, country, time spend in social media , ...

   -> Output or the predicted probability (ŷ) ranges from 0 to 1 

   -> Specify a threshhold -> 0.5      
   -> If the predicted probability (ŷ) > 0.5 => Customer will click (1/Yes)
   -> If the predicted probability (ŷ) < 0.5 => Customer will not click (0/No)

Overview¶

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- Splitting the dataset into the Training set and Test set

- Feature Scaling

- Training the Logistic Regression Model

- Predicting the Test Set Results

- Confusion Matrix

- Classification Report

- k-Fold Cross Validation

- Visualising the Training Set Results

- Visualising the Test Set Results

Importing the Relevant Libraries¶

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Loading the Data¶

url = "https://DataScienceSchools.github.io/Machine_Learning/Classification_Models_Intuition/Social_Network_Ads.csv"

df = pd.read_csv(url)

df.head()

Declaring the Dependent & the Independent Variables¶

X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Splitting the Dataset into the Training Set and Test Set¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

Feature Scaling¶

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Training the Logistic Regression Model¶

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state = 0)

model.fit(X_train, y_train)

LogisticRegression(random_state=0)

Predicting the Test Set Results¶

y_pred = model.predict(X_test)

Confusion Matrix¶

   - A confusion matrix used to describe the performance of a classification model

       - TP (True Positive): Model predicted Correctly
       - TN (True Negative): Model predicted Correctly
       - FP (Flase Positive): Model predicted True but it is actually False

           - Type I Error
           - Predicting people have cancer, but actually they do not have cancer
           - Predictiong earthquake will happen, but it actually does not happen

       - FN (False Negative): Model predicted False but it is actually True 

           - Type II Error -> Life-threatening Error (Must avoid it at all cost)
           - Predicting people do not have cancer, but actually they have
           - Predictiong earthquake will not happen, but it actually happens

   - Accuracy = Correct/Total 
   - Error Rate = Wrong/Total

source

from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://DataScienceSchools.github.io/Machine_Learning/Classification_Models_Intuition/confusionmatrix.jpg", width=400)

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy is: {:.2f} %".format(accuracy*100))

sns.heatmap(cm, annot=True, fmt="d")

plt.show()

Accuracy is: 92.50 %

Classification Report¶

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.98      0.95        58
           1       0.94      0.77      0.85        22

    accuracy                           0.93        80
   macro avg       0.93      0.88      0.90        80
weighted avg       0.93      0.93      0.92        80

k-Fold Cross Validation¶

   - Accuracy of test set is often a misleading metric

   - A solution to this problem is a procedure called cross-validation

   - k-fold cross-validation is used to evaluate machine learning models

   - How the performance measure is calculated by k-fold cross-validation? 

        1. The training set is split into k smaller sets  

        1. Each set is used as training data to train the model

        2. The remaining part of the data used as a test set to compute the accuracy 

        3. Then the average of all accuracies is calculated & reported

from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)

print("Accuracy: {:.2f} %".format(accuracies.mean()*100))

print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 82.50 %
Standard Deviation: 10.29 %

Visualising the Training Set Results¶

from matplotlib.colors import ListedColormap

X_set, y_set = X_train, y_train

X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, model.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('magenta', 'blue')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                color = ListedColormap(('magenta', 'blue'))(i), label = j)

    plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

Visualising the Test Set Results¶

from matplotlib.colors import ListedColormap

X_set, y_set = X_test, y_test

X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, model.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('magenta', 'blue')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                color = ListedColormap(('magenta', 'blue'))(i), label = j)
    
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

	Age	EstimatedSalary
0	19	19000
1	35	20000
2	26	43000
3	27	57000
4	19	76000

Case Study (Social Network Ads) :¶

SKLearrn (Logistic Regression)¶

Overview¶

Importing the Relevant Libraries¶

Loading the Data¶

Declaring the Dependent & the Independent Variables¶

Splitting the Dataset into the Training Set and Test Set¶

Feature Scaling¶

Training the Logistic Regression Model¶

Predicting the Test Set Results¶

Confusion Matrix¶

Classification Report¶

k-Fold Cross Validation¶

Visualising the Training Set Results¶

Visualising the Test Set Results¶