Case Study (Car Price) :

-  Predicting the price of a used car depending on its specifications

Data Preprocessing - Dealing with Outliers

Overview

- Importing the relevant libraries

- Loading data

- Exploring the Descriptive Statistics 

- Checking Missing Values

- Removing Missing Values

- Dropping 'Model'

- Exploring Descriptives Statistics without the Missing Values

- Exploring Probability Distribution Function (PDF)

- Seaborn distplot ('Price')

- Removing the Outliers ('Price')

- Seaborn distplot ('Price') after removing outliers

- Seaborn distplot 'Mileage'

- Removing the Outliers ('Mileage')

- Seaborn distplot 'Mileage' after removing outliers

- Seaborn distplot 'EngineV' 

- Removing the Outliers ('EngineV')

- Seaborn distplot 'EngineV' after removing outliers

- Seaborn distplot 'Year'

- Removing the Outliers ('Year')

- Seaborn distplot 'Mileage' after removing outliers

- Reset the Index (after removing observations)

- Descriptive Statistics after removing all outliers

- Saving Cleaned Data

Importing the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Loading data

-  The first potential aggressor is brand 
    as it is well-known that a BMW is generally more expensive than a Toyota

- The second relevant variable is mileage 
    since the more a car is driven the cheaper it should be

- Third the engine volume 
    sports cars have larger engines and economy cars have smaller engines 

- The final variable is year of production
    The older the car the cheaper it is with the exception of vintage vehicles

- The rest are categorical variables which we'll deal with on a case by case basis
In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/CaseStudy/LinearRegression/cars_price.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
0 BMW 4200.0 sedan 277 2.0 Petrol yes 1991 320
1 Mercedes-Benz 7900.0 van 427 2.9 Diesel yes 1999 Sprinter 212
2 Mercedes-Benz 13300.0 sedan 358 5.0 Gas yes 2003 S 500
3 Audi 23000.0 crossover 240 4.2 Petrol yes 2007 Q7
4 Toyota 18300.0 crossover 120 2.0 Petrol yes 2011 Rav 4

Exploring the Descriptive Statistics

- Descriptive Statistics: an easy way to check the data and spot problems 

- Descriptive statistics: Defualt: only descriptives for the numerical variables 

- To include the categorical ones, you should specify include='all'

- Categorical variables don't have some types of numerical descriptives

- Numerical variables don't have some types of categorical descriptives  

- Descriptive statistics: Defualt: only descriptives for the numerical variables 

- To include the categorical ones, you should specify include='all'

- Categorical variables don't have some types of numerical descriptives

- Numerical variables don't have some types of categorical descriptives

- Count: each variable has a different number of observations 

    -> impling there are some missing values

- Unique: unique entries for categorical variables 

    -> 312 unique models = 312 dummies -> really hard to implement regressioon 
In [3]:
df.describe(include='all')
Out[3]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
count 4345 4173.000000 4345 4345.000000 4195.000000 4345 4345 4345.000000 4345
unique 7 NaN 6 NaN NaN 4 2 NaN 312
top Volkswagen NaN sedan NaN NaN Diesel yes NaN E-Class
freq 936 NaN 1649 NaN NaN 2019 3947 NaN 199
mean NaN 19418.746935 NaN 161.237284 2.790734 NaN NaN 2006.550058 NaN
std NaN 25584.242620 NaN 105.705797 5.066437 NaN NaN 6.719097 NaN
min NaN 600.000000 NaN 0.000000 0.600000 NaN NaN 1969.000000 NaN
25% NaN 6999.000000 NaN 86.000000 1.800000 NaN NaN 2003.000000 NaN
50% NaN 11500.000000 NaN 155.000000 2.200000 NaN NaN 2008.000000 NaN
75% NaN 21700.000000 NaN 230.000000 3.000000 NaN NaN 2012.000000 NaN
max NaN 300000.000000 NaN 980.000000 99.990000 NaN NaN 2016.000000 NaN

Checking Missing Values

In [4]:
df.isnull().sum()
Out[4]:
Brand             0
Price           172
Body              0
Mileage           0
EngineV         150
Engine Type       0
Registration      0
Year              0
Model             0
dtype: int64

Removing Missing Values

- axis=0 means rows while axis=1 stands for columns
In [5]:
df = df.dropna(axis=0)

df.isnull().sum()
Out[5]:
Brand           0
Price           0
Body            0
Mileage         0
EngineV         0
Engine Type     0
Registration    0
Year            0
Model           0
dtype: int64

Dropping 'Model'

- Model is a categorical variable with 306 unique values
- 306 unique values -> 306 dummy variables, making it hard to implement regression
- Let's remove it
- axis=0 means rows while axis=1 stands for columns
In [6]:
df = df.drop('Model', axis=1)

Exploring Descriptives Statistics without the Missing Values

    - Price:

        mean      19552.308065
        min         600.000000
        25%        6999.000000
        50%       11500.000000
        75%       21900.000000
        max      300000.000000

    - Obviously we have a few outliers in the price variable

    - Outliers: Observations that lie on abnormal distance from other observations 

    - Outliers will affect the regression dramatically in cost coefficients to be inflated as the regression will try to place the line closer to those values

    - A similar issue can be seen in mileage, engine volume and year
In [7]:
df.describe()
Out[7]:
Price Mileage EngineV Year
count 4025.000000 4025.000000 4025.000000 4025.000000
mean 19552.308065 163.572174 2.764586 2006.379627
std 25815.734988 103.394703 4.935941 6.695595
min 600.000000 0.000000 0.600000 1969.000000
25% 6999.000000 90.000000 1.800000 2003.000000
50% 11500.000000 158.000000 2.200000 2007.000000
75% 21900.000000 230.000000 3.000000 2012.000000
max 300000.000000 980.000000 99.990000 2016.000000

Exploring Probability Distribution Function (PDF)

- Displaying PDF of a variable

- The PDF will show us how that variable is distributed 

- Easy way to spot anomalies, such as outliers

- PDF is often the basis on which we decide whether we want to transform a feature

- seaborn.distplot() -> 

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data

Seaborn distplot ('Price')

- Obviously there are some outliers 

- Outliers are situated around the higher prices (right side of the graph)

- Dealing with the problem -> easily by removing  1% of the problematic observations 
In [8]:
sns.distplot(df['Price'])
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58c340c110>

Removing the Outliers ('Price')

   - Eliminate Outliers by keeping the data below the 99 percentile
In [9]:
df = df[df['Price'] < df['Price'].quantile(0.99)]

Seaborn distplot ('Price') after removing outliers

- While the maximum value is still far away from the mean, it is acceptably closer
In [10]:
sns.distplot(df['Price'])
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58c2b09410>

Seaborn distplot 'Mileage'

- Obviously there are some outliers 

- Outliers are situated around the higher mileage(right side of the graph)

- Dealing with the problem -> easily by removing  1% of the problematic samples
In [11]:
sns.distplot(df['Mileage'])
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58c2a19550>

Removing the Outliers ('Mileage')

    - Eliminate Outliers by keeping the data below the 99 percentile
In [12]:
df = df[df['Mileage']< df['Mileage'].quantile(0.99)]

Seaborn distplot 'Mileage' after removing outliers

In [13]:
sns.distplot(df['Mileage'])
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58c299eb90>

Seaborn distplot 'EngineV'

- Obviously there are some outliers 

- Outliers are situated around the higher engine volumes(right side of the graph)

- Car engine volumes are usually below 6.5

- Dealing with the problem -> easily by removing the problematic samples > 6.5
In [14]:
sns.distplot(df['EngineV'])
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58c28e1cd0>

Removing the Outliers ('EngineV')

- Issue comes from the fact that most missing values are indicated with 99.99 or 99

- There are also some incorrect entries like 75

- Car engine volumes are usually below 6.5

- Eliminate Outliers by keeping the data below 6.5
In [15]:
df = df[df['EngineV']<6.5]

Seaborn distplot 'EngineV' after removing outliers

In [16]:
sns.distplot(df['EngineV'])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58c2786710>

Seaborn distplot 'Year'

- Obviously there are some outliers 

- Outliers are situated left side of the graph

- Dealing with the problem -> easily by removing  1% of the problematic samples
In [17]:
sns.distplot(df['Year'])
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58c26ea850>

Removing the Outliers ('Year')

- Eliminate Outliers by keeping the data higher than first percentile 
In [18]:
df = df[df['Year']> df['Year'].quantile(0.01)]

Seaborn distplot 'Year' after removing outliers

In [19]:
sns.distplot(df['Year'])
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58c264e850>

Reset the Index (after removing observations)

- The original indexes are preserved after removing observations

- If we remove observations with indexes 2 and 3, the indexes will be as: 0,1,4,5,6

- Once we reset the index, a column will be created containing the old index 

- To drop that column, add -> 'drop=True'
In [20]:
df = df.reset_index(drop=True)

Exploring descriptive statistics after removing all outliers

In [21]:
df.describe()
Out[21]:
Price Mileage EngineV Year
count 3867.000000 3867.000000 3867.000000 3867.000000
mean 18194.455679 160.542539 2.450440 2006.709853
std 19085.855165 95.633291 0.949366 6.103870
min 800.000000 0.000000 0.600000 1988.000000
25% 7200.000000 91.000000 1.800000 2003.000000
50% 11700.000000 157.000000 2.200000 2008.000000
75% 21700.000000 225.000000 3.000000 2012.000000
max 129222.000000 435.000000 6.300000 2016.000000

Saving Cleaned Data

In [22]:
df.to_csv('carprice_editted.csv', index=False)