Case Study (Car Price) :

Linear Regression - OLS Assumptions

Overview

- Importing the relevant libraries

- Loading data

- Checking the OLS assumptions

    1. Linearity

        - Checking Linearity using a Scatter Plot
        - Solution: Log Transformation
        - Scatter Plot after log transformation
        - Dropping Price Column

    2. No Endogeneity 

    3. Homogeneity

    4. No Autocorrelation

    5. Multicollinearity

        - Multicollinearity ('Mileage', 'Year', 'EngineV')
        - Dropping Year Column
        - Multicollinearity ('Mileage', 'EngineV')

- Saving Cleaned Data

Importing the relevant libraries

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

Loading data

In [4]:
url = "https://DataScienceSchools.github.io/Machine_Learning/CaseStudy/LinearRegression/carprice_editted.csv"

df = pd.read_csv(url)

df.head()
Out[4]:
Brand Price Body Mileage EngineV Engine Type Registration Year
0 BMW 4200.0 sedan 277 2.0 Petrol yes 1991
1 Mercedes-Benz 7900.0 van 427 2.9 Diesel yes 1999
2 Mercedes-Benz 13300.0 sedan 358 5.0 Gas yes 2003
3 Audi 23000.0 crossover 240 4.2 Petrol yes 2007
4 Toyota 18300.0 crossover 120 2.0 Petrol yes 2011

Checking the OLS assumptions

- The continuous variables -> price, year, engine volume and mileage 

1. Linearity

Checking Linearity using a Scatter Plot

- The patterns are quite exponential not linear ones 

- Log transformation is a common way to deal with this issue
In [21]:
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) #sharey -> share 'Price' as y

ax1.scatter(df['Year'],df['Price'])
ax1.set_title('Price and Year')

ax2.scatter(df['EngineV'],df['Price'])
ax2.set_title('Price and EngineV')

ax3.scatter(df['Mileage'],df['Price'])
ax3.set_title('Price and Mileage')


plt.show()

Solution: Log Transformation

- The log transformation is the most popular type of transformations

- It is used to transform skewed data to approximately conform to normality

- np.log(x) -> returns the natural logarithm of a number or array of numbers
In [22]:
df['log_price'] = np.log(df['Price'])

df.head()
Out[22]:
Brand Price Body Mileage EngineV Engine Type Registration Year log_price
0 BMW 4200.0 sedan 277 2.0 Petrol yes 1991 8.342840
1 Mercedes-Benz 7900.0 van 427 2.9 Diesel yes 1999 8.974618
2 Mercedes-Benz 13300.0 sedan 358 5.0 Gas yes 2003 9.495519
3 Audi 23000.0 crossover 240 4.2 Petrol yes 2007 10.043249
4 Toyota 18300.0 crossover 120 2.0 Petrol yes 2011 9.814656

Scatter Plot after log transformation

- Result: a linear pattern in all plots
In [23]:
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) 

ax1.scatter(df['Year'],df['log_price'])
ax1.set_title('Log Price and Year')

ax2.scatter(df['EngineV'],df['log_price'])
ax2.set_title('Log Price and EngineV')

ax3.scatter(df['Mileage'],df['log_price'])
ax3.set_title('Log Price and Mileage')


plt.show()

Dropping Price Column

- Price column is no longer needed
In [24]:
df = df.drop('Price', axis=1)

2. No Endogeneity

- The assumption is not violated

3. Normality & Homoscedasticity

- The assumption is not violated

4. No Autocorrelation

- The assumption is not violated

- The observations are not coming from time series data or panel data

- They are simply a snapshot of the current situation at a second hand car sales website

- Each row comes from a different customer who is willing to sell their car through the platform

- Logically there is no reason for the observations to be dependent on each other

5. No Multicollinearity

- A situation in which two or more explanatory variables in a multiple regression
model are highly linearly related

- Variance Inflation Factor(VIF) - StatsModels

- VIF = 1 -> No Multicollinearity 
- 1 < VIF < 5 -> Perfect 
- 5, 6 or 10 < VIF -> Unacceptable -> No firm consensus

Multicollinearity ('Mileage', 'Year', 'EngineV')

  - Since Year has the highest VIF, it affects the VIF of other variables

  - Let's remove year from the model

  - EngineV has a high VIF, once 'Year' is removed that will no longer be the case
In [25]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = df[['Mileage','Year','EngineV']]

vif = pd.DataFrame()

vif["Features"] = variables.columns

vif["VIF"] = [variance_inflation_factor(variables.values, i)
                        for i in range(variables.shape[1])]

vif
Out[25]:
Features VIF
0 Mileage 3.791584
1 Year 10.354854
2 EngineV 7.662068

Dropping Year Column

In [26]:
df = df.drop(['Year'],axis=1)

Multicollinearity ('Mileage', 'EngineV')

In [27]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = df[['Mileage','EngineV']]

vif = pd.DataFrame(data=variables.columns.values, columns=['Features'])

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif
Out[27]:
Features VIF
0 Mileage 2.805214
1 EngineV 2.805214

Saving Cleaned Data

In [28]:
df.to_csv('carprice_editted2.csv', index=False)