Case Study (Car Price) :¶

Linear Regression - Multicollinearity - DummyVariables¶

Overview¶

- Importing the relevant libraries

- Loading data

- Dummy Variables (drop_first = True)

    - VIF (All Variables)

    - VIF (dropping 'log_price' & 'Registration_yes')

- Creating Dummy Variables (default: drop_first = False)

    - VIF (All Variables)

Importing the relevant libraries¶

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

Loading data¶

url = "https://datascienceschools.github.io/Machine_Learning/CaseStudy/LinearRegression/carprice_editted2.csv"

df = pd.read_csv(url)

df.head()

Dummy Variables (drop_first = True)¶

- It is extremely important that we drop one of the dummies -> drop_first=True

- drop_first=True -> Brand_Audi will be removed

df1 = pd.get_dummies(df, drop_first=True)

df1.head()

VIF (All Variables)¶

- Obviously, 'log_price' has a very high VIF

- It is definitely **linearly correlated** with all the other variables

- We are using a linear regression to determine 'log_price' given values of the independent variables! This is exactly what we expect - a linear relationship!

- However, to actually assess multicollinearity for the predictors, we have to drop 'log_price'

- The multicollinearity assumption refers only to the idea that the **independent variables** shoud not be collinear.

from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = df1

vif = pd.DataFrame()

vif["Features"] = variables.columns

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif

VIF (dropping 'log_price' & 'Registration_yes')¶

    - the VIFs that are particularly high are 'EngineV' and 'Registration_yes' 

    - In the case of registration, the main issue is that most values are 'yes'

    - All independent variables are pretty good at determining 'log_price' 

    - If 'registration' is always 'yes', then if we predict 'log_price' we are predicting registration, too (it is going to be 'yes'). 

    - Whenever a single category is so predominant, we may just drop the variable

    - Note that it will most probably be insignificant anyways

from statsmodels.stats.outliers_influence import variance_inflation_factor

#variables = df1.drop(['log_price'],axis=1)

variables = df1.drop(['log_price', 'Registration_yes'],axis=1)

vif = pd.DataFrame()

vif["Features"] = variables.columns

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif

Creating Dummy Variables (default: drop_first = False)¶

- It is extremely important that we drop one of the dummies -> drop_first=True

- Let's see what happens if we do not drop first variable

df2 = pd.get_dummies(df)

df2.head()

VIF (All Variables)¶

- Most VIFs are equal to inf, or plus infinity

- We even got a warning: 

        -> RuntimeWarning: divide by zero encountered in double_scalars

- Reason: 

    -> When a car is an 'Audi' all other brand dummies are 0

    -> When a car is not 'Audi', at least one of them will be 1

    -> By including all dummies -> result is perfect multicollinearity

    -> By running a regression including all dummies

            - the coefficients would be inflated and completely off the mark

from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = df2

vif = pd.DataFrame()

vif["Features"] = variables.columns

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif

/home/bahar/anaconda3/lib/python3.7/site-packages/statsmodels/stats/outliers_influence.py:193: RuntimeWarning: divide by zero encountered in double_scalars
  vif = 1. / (1. - r_squared_i)

	Brand	Body	Mileage	EngineV	Engine Type	Registration	log_price
0	BMW	sedan	277	2.0	Petrol	yes	8.342840
1	Mercedes-Benz	van	427	2.9	Diesel	yes	8.974618
2	Mercedes-Benz	sedan	358	5.0	Gas	yes	9.495519
3	Audi	crossover	240	4.2	Petrol	yes	10.043249
4	Toyota	crossover	120	2.0	Petrol	yes	9.814656

	Mileage	EngineV	log_price	Brand_BMW	Brand_Mercedes-Benz	Brand_Toyota	Body_sedan	Body_van	Engine Type_Gas	Engine Type_Petrol	Registration_yes
0	277	2.0	8.342840	1	0	0	1	0	0	1	1
1	427	2.9	8.974618	0	1	0	0	1	0	0	1
2	358	5.0	9.495519	0	1	0	1	0	1	0	1
3	240	4.2	10.043249	0	0	0	0	0	0	1	1
4	120	2.0	9.814656	0	0	1	0	0	0	1	1

	Features	VIF
0	log_price	41.981260
1	Mileage	4.460434
2	EngineV	13.445639
3	Brand_BMW	2.603990
4	Brand_Mercedes-Benz	3.084356
5	Brand_Mitsubishi	1.830297
6	Brand_Renault	2.281498
7	Brand_Toyota	2.406546
8	Brand_Volkswagen	3.312814
9	Body_hatch	1.583516
10	Body_other	1.597487
11	Body_sedan	3.455354
12	Body_vagon	1.810633
13	Body_van	2.579105
14	Engine Type_Gas	1.711589
15	Engine Type_Other	1.082223
16	Engine Type_Petrol	2.506715
17	Registration_yes	15.167906

	Features	VIF
0	Mileage	4.419925
1	EngineV	6.215019
2	Brand_BMW	2.173765
3	Brand_Mercedes-Benz	2.674018
4	Brand_Mitsubishi	1.466501
5	Brand_Renault	1.795902
6	Brand_Toyota	1.943698
7	Brand_Volkswagen	2.427961
8	Body_hatch	1.434468
9	Body_other	1.494911
10	Body_sedan	3.019978
11	Body_vagon	1.554847
12	Body_van	2.316951
13	Engine Type_Gas	1.656714
14	Engine Type_Other	1.080967
15	Engine Type_Petrol	2.415590

	Mileage	EngineV	log_price	Brand_Audi	Brand_BMW	Brand_Mercedes-Benz	Brand_Toyota	...	Body_sedan	Body_van	Engine Type_Diesel	Engine Type_Gas	Engine Type_Petrol	Registration_yes
0	277	2.0	8.342840	0	1	0	0	...	1	0	0	0	1	1
1	427	2.9	8.974618	0	0	1	0	...	0	1	1	0	0	1
2	358	5.0	9.495519	0	0	1	0	...	1	0	0	1	0	1
3	240	4.2	10.043249	1	0	0	0	...	0	0	0	0	1	1
4	120	2.0	9.814656	0	0	0	1	...	0	0	0	0	1	1

	Features	VIF
0	Mileage	2.365473
1	EngineV	1.812970
2	log_price	4.018878
3	Brand_Audi	inf
4	Brand_BMW	inf
5	Brand_Mercedes-Benz	inf
6	Brand_Mitsubishi	inf
7	Brand_Renault	inf
8	Brand_Toyota	inf
9	Brand_Volkswagen	inf
10	Body_crossover	inf
11	Body_hatch	inf
12	Body_other	inf
13	Body_sedan	inf
14	Body_vagon	inf
15	Body_van	inf
16	Engine Type_Diesel	inf
17	Engine Type_Gas	inf
18	Engine Type_Other	inf
19	Engine Type_Petrol	inf
20	Registration_no	inf
21	Registration_yes	inf