- Importing the relevant libraries
- Loading data
- Dummy Variables
- Declaring the dependent and independent variables
- Adding a Constant
- Creating a Logit Regression
- Fitting the Model
- Logit Regression Summary
- Finding the odds
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
url = 'https://datascienceschools.github.io/Machine_Learning/StatsModel/admission.csv'
df = pd.read_csv(url)
df.head()
- Replace all No entries with 0, and all Yes entries with 1
data = df.copy()
data['Admitted'] = data['Admitted'].map({'Yes': 1, 'No': 0})
data['Gender'] = data['Gender'].map({'Female': 1, 'Male': 0})
data.head()
y = data['Admitted']
x = data[['SAT','Gender']]
x_constant = sm.add_constant(x)
model = sm.Logit(y,x_constant)
results = model.fit()
- MLE (Maximum Likelihood Estimation)
- log likelihood: the value of the log likelihood is almost but not always negative
- The bigger the likelihood function, the higher the probability that our model is correct
- LL-Null is the log likelihood of a model which has no independent variables
- LLR p-value
- Well you may want to compare the log likelihood of your model with the LLNL to see if your model has any explanatory power, seeing if our model is significant.
- LL-Null measures if our model is statistically different from the L-L now aka a useless model without telling you the exact way to perform it.
- We have it's P-value and that's all we need as we can see it is very low around zero point zero zero Our model is significant
- Pseudo R-squ.: A good Pseudo R-Squared is somewhere between 0.2 and 0.4
results.summary()
- π : The probability of an event occurring
- 1-π : the probability of the event not occurring
- (π/1-π) -> odds
- coef
- SAT 0.0406
- Gender 1.9449
- Female:1 , male:0
- Given the same SAT score,
-> a female is 7 times more likely to be admitted than a male
np.exp(1.9449)