Case Study (Car Price) :

Data Preprocessing in Machine Learning - Outliers

Importing the relevant libraries

In [53]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Loading data

In [54]:
url = "https://datascienceschools.github.io/cars_price.csv"

df = pd.read_csv(url)

df.head()
Out[54]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
0 BMW 4200.0 sedan 277 2.0 Petrol yes 1991 320
1 Mercedes-Benz 7900.0 van 427 2.9 Diesel yes 1999 Sprinter 212
2 Mercedes-Benz 13300.0 sedan 358 5.0 Gas yes 2003 S 500
3 Audi 23000.0 crossover 240 4.2 Petrol yes 2007 Q7
4 Toyota 18300.0 crossover 120 2.0 Petrol yes 2011 Rav 4

Exploring the descriptive statistics of the variables

- Descriptive statistics: only descriptives for the numerical variables are shown

- To include the categorical ones, you should specify include='all'

- Categorical variables don't have some types of numerical descriptives

- Numerical variables don't have some types of categorical descriptives
In [55]:
df.describe(include='all')
Out[55]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
count 4345 4173.000000 4345 4345.000000 4195.000000 4345 4345 4345.000000 4345
unique 7 NaN 6 NaN NaN 4 2 NaN 312
top Volkswagen NaN sedan NaN NaN Diesel yes NaN E-Class
freq 936 NaN 1649 NaN NaN 2019 3947 NaN 199
mean NaN 19418.746935 NaN 161.237284 2.790734 NaN NaN 2006.550058 NaN
std NaN 25584.242620 NaN 105.705797 5.066437 NaN NaN 6.719097 NaN
min NaN 600.000000 NaN 0.000000 0.600000 NaN NaN 1969.000000 NaN
25% NaN 6999.000000 NaN 86.000000 1.800000 NaN NaN 2003.000000 NaN
50% NaN 11500.000000 NaN 155.000000 2.200000 NaN NaN 2008.000000 NaN
75% NaN 21700.000000 NaN 230.000000 3.000000 NaN NaN 2012.000000 NaN
max NaN 300000.000000 NaN 980.000000 99.990000 NaN NaN 2016.000000 NaN

Dropping 'Model'

- 'Model' is a categorical variable and has 306 unique values
- 306 unique values -> 306 dummy variables, making it hard to implement regression
- Let's remove it
In [56]:
df = df.drop('Model', axis=1)

Checking Missing Values

In [57]:
df.isnull().sum()
Out[57]:
Brand             0
Price           172
Body              0
Mileage           0
EngineV         150
Engine Type       0
Registration      0
Year              0
dtype: int64

Removing Missing Values

In [58]:
df = df.dropna(axis=0)

df.isnull().sum()
Out[58]:
Brand           0
Price           0
Body            0
Mileage         0
EngineV         0
Engine Type     0
Registration    0
Year            0
dtype: int64

Checking descriptives statistics without the missing values

In [59]:
df.describe()
Out[59]:
Price Mileage EngineV Year
count 4025.000000 4025.000000 4025.000000 4025.000000
mean 19552.308065 163.572174 2.764586 2006.379627
std 25815.734988 103.394703 4.935941 6.695595
min 600.000000 0.000000 0.600000 1969.000000
25% 6999.000000 90.000000 1.800000 2003.000000
50% 11500.000000 158.000000 2.200000 2007.000000
75% 21900.000000 230.000000 3.000000 2012.000000
max 300000.000000 980.000000 99.990000 2016.000000

Exploring Probability Distribution Function (PDF)

- Displaying PDF of a variable

- The PDF will show us how that variable is distributed 

- Easy way to spot anomalies, such as outliers

- PDF is often the basis on which we decide whether we want to transform a feature

- seaborn.distplot() -> 

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data

Seaborn distplot ('Price')

- Obviously there are some outliers 

- Outliers are situated around the higher prices (right side of the graph)

- Dealing with the problem -> easily by removing  1% of the problematic samples
In [60]:
sns.distplot(df['Price'])
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b92e67f10>

Removing the Outliers ('Price')

In [61]:
df = df[df['Price'] < df['Price'].quantile(0.99)]

Seaborn distplot ('Price') after removing outliers

In [62]:
sns.distplot(df['Price'])
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b92ba3510>

Seaborn distplot 'Mileage'

- Obviously there are some outliers 

- Outliers are situated around the higher mileage(right side of the graph)

- Dealing with the problem -> easily by removing  1% of the problematic samples
In [63]:
sns.distplot(df['Mileage'])
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b92aaacd0>

Removing the Outliers ('Mileage')

In [64]:
df = df[df['Mileage']< df['Mileage'].quantile(0.99)]

Seaborn distplot 'Mileage' after removing outliers

In [65]:
sns.distplot(df['Mileage'])
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b929a1310>

Seaborn distplot 'EngineV'

- Obviously there are some outliers 

- Outliers are situated around the higher engine volumes(right side of the graph)

- But, car engine volumes are usually below 6.5

- Dealing with the problem -> easily by removing the problematic samples > 6.5
In [66]:
sns.distplot(df['EngineV'])
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b946c93d0>

Removing the Outliers ('EngineV')

- Issue comes from the fact that most missing values are indicated with 99.99 or 99

- There are also some incorrect entries like 75

- Car engine volumes are usually below 6.5
In [67]:
df = df[df['EngineV']<6.5]

Seaborn distplot 'EngineV' after removing outliers

In [68]:
sns.distplot(df['EngineV'])
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b9282fd10>

Seaborn distplot 'Year'

- Obviously there are some outliers 

- Outliers are situated left side of the graph

- Dealing with the problem -> easily by removing  1% of the problematic samples
In [69]:
sns.distplot(df['Year'])
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b92796d50>

Removing the Outliers ('Year')

In [70]:
df = df[df['Year']> df['Year'].quantile(0.01)]

Seaborn distplot 'Year' after removing outliers

In [71]:
sns.distplot(df['Year'])
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b926e4250>

Reset the Index

- The original indexes are preserved after removing observations

- If we remove observations with indexes 2 and 3, the indexes will be as: 0,1,4,5,6

- Once we reset the index, a new column will be created containing the old index 

- To drop that column, add -> 'drop=True'
In [72]:
df = df.reset_index(drop=True)

Descriptive Statistics after removing all outliers

In [73]:
df.describe()
Out[73]:
Price Mileage EngineV Year
count 3867.000000 3867.000000 3867.000000 3867.000000
mean 18194.455679 160.542539 2.450440 2006.709853
std 19085.855165 95.633291 0.949366 6.103870
min 800.000000 0.000000 0.600000 1988.000000
25% 7200.000000 91.000000 1.800000 2003.000000
50% 11700.000000 157.000000 2.200000 2008.000000
75% 21700.000000 225.000000 3.000000 2012.000000
max 129222.000000 435.000000 6.300000 2016.000000