Case Study (University Admission) :

StatsModels (Logistic Regression) - Logit Regression Results

Overview

- Importing the relevant libraries

- Loading data

- Dummy Variables

- Declaring the dependent and independent variables

- Adding a Constant

- Creating a Logit Regression 

- Fitting the Model

- Logit Regression Summary

- Finding the odds 

Importing the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Loading data

In [2]:
url = 'https://datascienceschools.github.io/Machine_Learning/StatsModel/admission.csv'

df = pd.read_csv(url)

df.head()
Out[2]:
SAT Admitted Gender
0 1363 No Male
1 1792 Yes Female
2 1954 Yes Female
3 1653 No Male
4 1593 No Male

Dummy Variables

- Replace all No entries with 0, and all Yes entries with 1
In [3]:
data = df.copy()

data['Admitted'] = data['Admitted'].map({'Yes': 1, 'No': 0})
data['Gender'] = data['Gender'].map({'Female': 1, 'Male': 0})

data.head()
Out[3]:
SAT Admitted Gender
0 1363 0 0
1 1792 1 1
2 1954 1 1
3 1653 0 0
4 1593 0 0

Declaring the dependent and independent variables

In [4]:
y = data['Admitted']

x = data[['SAT','Gender']]

Adding a Constant

In [5]:
x_constant = sm.add_constant(x)

Creating a Logit Regression

In [6]:
model = sm.Logit(y,x_constant)

Fitting the Model

In [7]:
results = model.fit()
Optimization terminated successfully.
         Current function value: 0.120117
         Iterations 10

Logit Regression Summary

- MLE (Maximum Likelihood Estimation)

- log likelihood: the value of the log likelihood is almost but not always negative

- The bigger the likelihood function, the higher the probability that our model is correct

- LL-Null is the log likelihood of a model which has no independent variables

- LLR p-value

- Well you may want to compare the log likelihood of your model with the LLNL to see if your model has any explanatory power, seeing if our model is significant.

- LL-Null measures if our model is statistically different from the L-L now aka a useless model without telling you the exact way to perform it.

- We have it's P-value and that's all we need as we can see it is very low around zero point zero zero Our model is significant

- Pseudo R-squ.: A good Pseudo R-Squared is somewhere between 0.2 and 0.4         
In [8]:
results.summary()
Out[8]:
Logit Regression Results
Dep. Variable: Admitted No. Observations: 168
Model: Logit Df Residuals: 165
Method: MLE Df Model: 2
Date: Mon, 05 Oct 2020 Pseudo R-squ.: 0.8249
Time: 04:52:03 Log-Likelihood: -20.180
converged: True LL-Null: -115.26
Covariance Type: nonrobust LLR p-value: 5.118e-42
coef std err z P>|z| [0.025 0.975]
const -68.3489 16.454 -4.154 0.000 -100.598 -36.100
SAT 0.0406 0.010 4.129 0.000 0.021 0.060
Gender 1.9449 0.846 2.299 0.022 0.287 3.603


Possibly complete quasi-separation: A fraction 0.27 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Finding the odds

- π : The probability of an event occurring 

- 1-π : the probability of the event not occurring    

- (π/1-π) -> odds

- coef
    - SAT       0.0406  
    - Gender     1.9449

- Female:1 , male:0

- Given the same SAT score,

        -> a female is 7 times more likely to be admitted than a male
In [16]:
np.exp(1.9449)
Out[16]:
6.992932526814459