- Importing the relevant libraries
- Loading data
- Dummy Variables (drop_first = True)
- VIF (All Variables)
- VIF (dropping 'log_price' & 'Registration_yes')
- Creating Dummy Variables (default: drop_first = False)
- VIF (All Variables)
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()
url = "https://datascienceschools.github.io/Machine_Learning/CaseStudy/LinearRegression/carprice_editted2.csv"
df = pd.read_csv(url)
df.head()
- It is extremely important that we drop one of the dummies -> drop_first=True
- drop_first=True -> Brand_Audi will be removed
df1 = pd.get_dummies(df, drop_first=True)
df1.head()
- Obviously, 'log_price' has a very high VIF
- It is definitely **linearly correlated** with all the other variables
- We are using a linear regression to determine 'log_price' given values of the independent variables! This is exactly what we expect - a linear relationship!
- However, to actually assess multicollinearity for the predictors, we have to drop 'log_price'
- The multicollinearity assumption refers only to the idea that the **independent variables** shoud not be collinear.
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = df1
vif = pd.DataFrame()
vif["Features"] = variables.columns
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif
- the VIFs that are particularly high are 'EngineV' and 'Registration_yes'
- In the case of registration, the main issue is that most values are 'yes'
- All independent variables are pretty good at determining 'log_price'
- If 'registration' is always 'yes', then if we predict 'log_price' we are predicting registration, too (it is going to be 'yes').
- Whenever a single category is so predominant, we may just drop the variable
- Note that it will most probably be insignificant anyways
from statsmodels.stats.outliers_influence import variance_inflation_factor
#variables = df1.drop(['log_price'],axis=1)
variables = df1.drop(['log_price', 'Registration_yes'],axis=1)
vif = pd.DataFrame()
vif["Features"] = variables.columns
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif
- It is extremely important that we drop one of the dummies -> drop_first=True
- Let's see what happens if we do not drop first variable
df2 = pd.get_dummies(df)
df2.head()
- Most VIFs are equal to inf, or plus infinity
- We even got a warning:
-> RuntimeWarning: divide by zero encountered in double_scalars
- Reason:
-> When a car is an 'Audi' all other brand dummies are 0
-> When a car is not 'Audi', at least one of them will be 1
-> By including all dummies -> result is perfect multicollinearity
-> By running a regression including all dummies
- the coefficients would be inflated and completely off the mark
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = df2
vif = pd.DataFrame()
vif["Features"] = variables.columns
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif