- Importing the relevant libraries
- Loading data
- Checking the OLS assumptions
1. Linearity
- Checking Linearity using a Scatter Plot
- Solution: Log Transformation
- Scatter Plot after log transformation
- Dropping Price Column
2. No Endogeneity
3. Homogeneity
4. No Autocorrelation
5. Multicollinearity
- Multicollinearity ('Mileage', 'Year', 'EngineV')
- Dropping Year Column
- Multicollinearity ('Mileage', 'EngineV')
- Saving Cleaned Data
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()
url = "https://DataScienceSchools.github.io/Machine_Learning/CaseStudy/LinearRegression/carprice_editted.csv"
df = pd.read_csv(url)
df.head()
- The continuous variables -> price, year, engine volume and mileage
- The patterns are quite exponential not linear ones
- Log transformation is a common way to deal with this issue
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) #sharey -> share 'Price' as y
ax1.scatter(df['Year'],df['Price'])
ax1.set_title('Price and Year')
ax2.scatter(df['EngineV'],df['Price'])
ax2.set_title('Price and EngineV')
ax3.scatter(df['Mileage'],df['Price'])
ax3.set_title('Price and Mileage')
plt.show()
- The log transformation is the most popular type of transformations
- It is used to transform skewed data to approximately conform to normality
- np.log(x) -> returns the natural logarithm of a number or array of numbers
df['log_price'] = np.log(df['Price'])
df.head()
- Result: a linear pattern in all plots
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3))
ax1.scatter(df['Year'],df['log_price'])
ax1.set_title('Log Price and Year')
ax2.scatter(df['EngineV'],df['log_price'])
ax2.set_title('Log Price and EngineV')
ax3.scatter(df['Mileage'],df['log_price'])
ax3.set_title('Log Price and Mileage')
plt.show()
- Price column is no longer needed
df = df.drop('Price', axis=1)
- The assumption is not violated
- The assumption is not violated
- The assumption is not violated
- The observations are not coming from time series data or panel data
- They are simply a snapshot of the current situation at a second hand car sales website
- Each row comes from a different customer who is willing to sell their car through the platform
- Logically there is no reason for the observations to be dependent on each other
- A situation in which two or more explanatory variables in a multiple regression
model are highly linearly related
- Variance Inflation Factor(VIF) - StatsModels
- VIF = 1 -> No Multicollinearity
- 1 < VIF < 5 -> Perfect
- 5, 6 or 10 < VIF -> Unacceptable -> No firm consensus
- Since Year has the highest VIF, it affects the VIF of other variables
- Let's remove year from the model
- EngineV has a high VIF, once 'Year' is removed that will no longer be the case
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = df[['Mileage','Year','EngineV']]
vif = pd.DataFrame()
vif["Features"] = variables.columns
vif["VIF"] = [variance_inflation_factor(variables.values, i)
for i in range(variables.shape[1])]
vif
df = df.drop(['Year'],axis=1)
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = df[['Mileage','EngineV']]
vif = pd.DataFrame(data=variables.columns.values, columns=['Features'])
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif
df.to_csv('carprice_editted2.csv', index=False)