Case Study (Car Price) :

Linear Regression - Multicollinearity - DummyVariables

Overview

- Importing the relevant libraries

- Loading data

- Dummy Variables (drop_first = True)

    - VIF (All Variables)

    - VIF (dropping 'log_price' & 'Registration_yes')

- Creating Dummy Variables (default: drop_first = False)

    - VIF (All Variables)

Importing the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

Loading data

In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/CaseStudy/LinearRegression/carprice_editted2.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
Brand Body Mileage EngineV Engine Type Registration log_price
0 BMW sedan 277 2.0 Petrol yes 8.342840
1 Mercedes-Benz van 427 2.9 Diesel yes 8.974618
2 Mercedes-Benz sedan 358 5.0 Gas yes 9.495519
3 Audi crossover 240 4.2 Petrol yes 10.043249
4 Toyota crossover 120 2.0 Petrol yes 9.814656

Dummy Variables (drop_first = True)

- It is extremely important that we drop one of the dummies -> drop_first=True

- drop_first=True -> Brand_Audi will be removed
In [14]:
df1 = pd.get_dummies(df, drop_first=True)

df1.head()
Out[14]:
Mileage EngineV log_price Brand_BMW Brand_Mercedes-Benz Brand_Mitsubishi Brand_Renault Brand_Toyota Brand_Volkswagen Body_hatch Body_other Body_sedan Body_vagon Body_van Engine Type_Gas Engine Type_Other Engine Type_Petrol Registration_yes
0 277 2.0 8.342840 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 427 2.9 8.974618 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1
2 358 5.0 9.495519 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1
3 240 4.2 10.043249 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
4 120 2.0 9.814656 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1

VIF (All Variables)

- Obviously, 'log_price' has a very high VIF

- It is definitely **linearly correlated** with all the other variables

- We are using a linear regression to determine 'log_price' given values of the independent variables! This is exactly what we expect - a linear relationship!

- However, to actually assess multicollinearity for the predictors, we have to drop 'log_price'

- The multicollinearity assumption refers only to the idea that the **independent variables** shoud not be collinear.
In [17]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = df1

vif = pd.DataFrame()

vif["Features"] = variables.columns

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif
Out[17]:
Features VIF
0 log_price 41.981260
1 Mileage 4.460434
2 EngineV 13.445639
3 Brand_BMW 2.603990
4 Brand_Mercedes-Benz 3.084356
5 Brand_Mitsubishi 1.830297
6 Brand_Renault 2.281498
7 Brand_Toyota 2.406546
8 Brand_Volkswagen 3.312814
9 Body_hatch 1.583516
10 Body_other 1.597487
11 Body_sedan 3.455354
12 Body_vagon 1.810633
13 Body_van 2.579105
14 Engine Type_Gas 1.711589
15 Engine Type_Other 1.082223
16 Engine Type_Petrol 2.506715
17 Registration_yes 15.167906

VIF (dropping 'log_price' & 'Registration_yes')

    - the VIFs that are particularly high are 'EngineV' and 'Registration_yes' 

    - In the case of registration, the main issue is that most values are 'yes'

    - All independent variables are pretty good at determining 'log_price' 

    - If 'registration' is always 'yes', then if we predict 'log_price' we are predicting registration, too (it is going to be 'yes'). 

    - Whenever a single category is so predominant, we may just drop the variable

    - Note that it will most probably be insignificant anyways
In [18]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

#variables = df1.drop(['log_price'],axis=1)

variables = df1.drop(['log_price', 'Registration_yes'],axis=1)

vif = pd.DataFrame()

vif["Features"] = variables.columns

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif
Out[18]:
Features VIF
0 Mileage 4.419925
1 EngineV 6.215019
2 Brand_BMW 2.173765
3 Brand_Mercedes-Benz 2.674018
4 Brand_Mitsubishi 1.466501
5 Brand_Renault 1.795902
6 Brand_Toyota 1.943698
7 Brand_Volkswagen 2.427961
8 Body_hatch 1.434468
9 Body_other 1.494911
10 Body_sedan 3.019978
11 Body_vagon 1.554847
12 Body_van 2.316951
13 Engine Type_Gas 1.656714
14 Engine Type_Other 1.080967
15 Engine Type_Petrol 2.415590

Creating Dummy Variables (default: drop_first = False)

- It is extremely important that we drop one of the dummies -> drop_first=True

- Let's see what happens if we do not drop first variable
In [21]:
df2 = pd.get_dummies(df)

df2.head()
Out[21]:
Mileage EngineV log_price Brand_Audi Brand_BMW Brand_Mercedes-Benz Brand_Mitsubishi Brand_Renault Brand_Toyota Brand_Volkswagen ... Body_other Body_sedan Body_vagon Body_van Engine Type_Diesel Engine Type_Gas Engine Type_Other Engine Type_Petrol Registration_no Registration_yes
0 277 2.0 8.342840 0 1 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 1
1 427 2.9 8.974618 0 0 1 0 0 0 0 ... 0 0 0 1 1 0 0 0 0 1
2 358 5.0 9.495519 0 0 1 0 0 0 0 ... 0 1 0 0 0 1 0 0 0 1
3 240 4.2 10.043249 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 1
4 120 2.0 9.814656 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 1

5 rows × 22 columns

VIF (All Variables)

- Most VIFs are equal to inf, or plus infinity

- We even got a warning: 

        -> RuntimeWarning: divide by zero encountered in double_scalars

- Reason: 

    -> When a car is an 'Audi' all other brand dummies are 0

    -> When a car is not 'Audi', at least one of them will be 1

    -> By including all dummies -> result is perfect multicollinearity

    -> By running a regression including all dummies

            - the coefficients would be inflated and completely off the mark
In [20]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = df2

vif = pd.DataFrame()

vif["Features"] = variables.columns

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif
/home/bahar/anaconda3/lib/python3.7/site-packages/statsmodels/stats/outliers_influence.py:193: RuntimeWarning: divide by zero encountered in double_scalars
  vif = 1. / (1. - r_squared_i)
Out[20]:
Features VIF
0 Mileage 2.365473
1 EngineV 1.812970
2 log_price 4.018878
3 Brand_Audi inf
4 Brand_BMW inf
5 Brand_Mercedes-Benz inf
6 Brand_Mitsubishi inf
7 Brand_Renault inf
8 Brand_Toyota inf
9 Brand_Volkswagen inf
10 Body_crossover inf
11 Body_hatch inf
12 Body_other inf
13 Body_sedan inf
14 Body_vagon inf
15 Body_van inf
16 Engine Type_Diesel inf
17 Engine Type_Gas inf
18 Engine Type_Other inf
19 Engine Type_Petrol inf
20 Registration_no inf
21 Registration_yes inf