Case Study (Car Price) :¶

Linear Regression - OLS Assumptions¶

Overview¶

- Importing the relevant libraries

- Loading data

- Checking the OLS assumptions

    1. Linearity

        - Checking Linearity using a Scatter Plot
        - Solution: Log Transformation
        - Scatter Plot after log transformation
        - Dropping Price Column

    2. No Endogeneity 

    3. Homogeneity

    4. No Autocorrelation

    5. Multicollinearity

        - Multicollinearity ('Mileage', 'Year', 'EngineV')
        - Dropping Year Column
        - Multicollinearity ('Mileage', 'EngineV')

- Saving Cleaned Data

Importing the relevant libraries¶

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

Loading data¶

url = "https://DataScienceSchools.github.io/Machine_Learning/CaseStudy/LinearRegression/carprice_editted.csv"

df = pd.read_csv(url)

df.head()

Checking the OLS assumptions¶

- The continuous variables -> price, year, engine volume and mileage

1. Linearity¶

Checking Linearity using a Scatter Plot¶

- The patterns are quite exponential not linear ones 

- Log transformation is a common way to deal with this issue

f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) #sharey -> share 'Price' as y

ax1.scatter(df['Year'],df['Price'])
ax1.set_title('Price and Year')

ax2.scatter(df['EngineV'],df['Price'])
ax2.set_title('Price and EngineV')

ax3.scatter(df['Mileage'],df['Price'])
ax3.set_title('Price and Mileage')


plt.show()

Solution: Log Transformation¶

- The log transformation is the most popular type of transformations

- It is used to transform skewed data to approximately conform to normality

- np.log(x) -> returns the natural logarithm of a number or array of numbers

df['log_price'] = np.log(df['Price'])

df.head()

Scatter Plot after log transformation¶

- Result: a linear pattern in all plots

f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) 

ax1.scatter(df['Year'],df['log_price'])
ax1.set_title('Log Price and Year')

ax2.scatter(df['EngineV'],df['log_price'])
ax2.set_title('Log Price and EngineV')

ax3.scatter(df['Mileage'],df['log_price'])
ax3.set_title('Log Price and Mileage')


plt.show()

Dropping Price Column¶

- Price column is no longer needed

df = df.drop('Price', axis=1)

2. No Endogeneity¶

- The assumption is not violated

3. Normality & Homoscedasticity¶

- The assumption is not violated

4. No Autocorrelation¶

- The assumption is not violated

- The observations are not coming from time series data or panel data

- They are simply a snapshot of the current situation at a second hand car sales website

- Each row comes from a different customer who is willing to sell their car through the platform

- Logically there is no reason for the observations to be dependent on each other

5. No Multicollinearity¶

- A situation in which two or more explanatory variables in a multiple regression
model are highly linearly related

- Variance Inflation Factor(VIF) - StatsModels

- VIF = 1 -> No Multicollinearity 
- 1 < VIF < 5 -> Perfect 
- 5, 6 or 10 < VIF -> Unacceptable -> No firm consensus

Multicollinearity ('Mileage', 'Year', 'EngineV')¶

  - Since Year has the highest VIF, it affects the VIF of other variables

  - Let's remove year from the model

  - EngineV has a high VIF, once 'Year' is removed that will no longer be the case

from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = df[['Mileage','Year','EngineV']]

vif = pd.DataFrame()

vif["Features"] = variables.columns

vif["VIF"] = [variance_inflation_factor(variables.values, i)
                        for i in range(variables.shape[1])]

vif

Dropping Year Column¶

df = df.drop(['Year'],axis=1)

Multicollinearity ('Mileage', 'EngineV')¶

from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = df[['Mileage','EngineV']]

vif = pd.DataFrame(data=variables.columns.values, columns=['Features'])

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif

Saving Cleaned Data¶

df.to_csv('carprice_editted2.csv', index=False)

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year
0	BMW	4200.0	sedan	277	2.0	Petrol	yes	1991
1	Mercedes-Benz	7900.0	van	427	2.9	Diesel	yes	1999
2	Mercedes-Benz	13300.0	sedan	358	5.0	Gas	yes	2003
3	Audi	23000.0	crossover	240	4.2	Petrol	yes	2007
4	Toyota	18300.0	crossover	120	2.0	Petrol	yes	2011

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	log_price
0	BMW	4200.0	sedan	277	2.0	Petrol	yes	1991	8.342840
1	Mercedes-Benz	7900.0	van	427	2.9	Diesel	yes	1999	8.974618
2	Mercedes-Benz	13300.0	sedan	358	5.0	Gas	yes	2003	9.495519
3	Audi	23000.0	crossover	240	4.2	Petrol	yes	2007	10.043249
4	Toyota	18300.0	crossover	120	2.0	Petrol	yes	2011	9.814656

	Features	VIF
0	Mileage	3.791584
1	Year	10.354854
2	EngineV	7.662068

	Features	VIF
0	Mileage	2.805214
1	EngineV	2.805214