Case Study (Car Price) :¶

Data Preprocessing in Machine Learning - Outliers¶

Importing the relevant libraries¶

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Loading data¶

url = "https://datascienceschools.github.io/cars_price.csv"

df = pd.read_csv(url)

df.head()

Exploring the descriptive statistics of the variables¶

- Descriptive statistics: only descriptives for the numerical variables are shown

- To include the categorical ones, you should specify include='all'

- Categorical variables don't have some types of numerical descriptives

- Numerical variables don't have some types of categorical descriptives

df.describe(include='all')

Dropping 'Model'¶

- 'Model' is a categorical variable and has 306 unique values
- 306 unique values -> 306 dummy variables, making it hard to implement regression
- Let's remove it

df = df.drop('Model', axis=1)

Checking Missing Values¶

df.isnull().sum()

Brand             0
Price           172
Body              0
Mileage           0
EngineV         150
Engine Type       0
Registration      0
Year              0
dtype: int64

Removing Missing Values¶

df = df.dropna(axis=0)

df.isnull().sum()

Brand           0
Price           0
Body            0
Mileage         0
EngineV         0
Engine Type     0
Registration    0
Year            0
dtype: int64

Checking descriptives statistics without the missing values¶

df.describe()

Exploring Probability Distribution Function (PDF)¶

- Displaying PDF of a variable

- The PDF will show us how that variable is distributed 

- Easy way to spot anomalies, such as outliers

- PDF is often the basis on which we decide whether we want to transform a feature

- seaborn.distplot() -> 

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data

Seaborn distplot ('Price')¶

- Obviously there are some outliers 

- Outliers are situated around the higher prices (right side of the graph)

- Dealing with the problem -> easily by removing  1% of the problematic samples

sns.distplot(df['Price'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b92e67f10>

Removing the Outliers ('Price')¶

df = df[df['Price'] < df['Price'].quantile(0.99)]

Seaborn distplot ('Price') after removing outliers¶

sns.distplot(df['Price'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b92ba3510>

Seaborn distplot 'Mileage'¶

- Obviously there are some outliers 

- Outliers are situated around the higher mileage(right side of the graph)

- Dealing with the problem -> easily by removing  1% of the problematic samples

sns.distplot(df['Mileage'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b92aaacd0>

Removing the Outliers ('Mileage')¶

df = df[df['Mileage']< df['Mileage'].quantile(0.99)]

Seaborn distplot 'Mileage' after removing outliers¶

sns.distplot(df['Mileage'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b929a1310>

Seaborn distplot 'EngineV'¶

- Obviously there are some outliers 

- Outliers are situated around the higher engine volumes(right side of the graph)

- But, car engine volumes are usually below 6.5

- Dealing with the problem -> easily by removing the problematic samples > 6.5

sns.distplot(df['EngineV'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b946c93d0>

Removing the Outliers ('EngineV')¶

- Issue comes from the fact that most missing values are indicated with 99.99 or 99

- There are also some incorrect entries like 75

- Car engine volumes are usually below 6.5

df = df[df['EngineV']<6.5]

Seaborn distplot 'EngineV' after removing outliers¶

sns.distplot(df['EngineV'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b9282fd10>

Seaborn distplot 'Year'¶

- Obviously there are some outliers 

- Outliers are situated left side of the graph

- Dealing with the problem -> easily by removing  1% of the problematic samples

sns.distplot(df['Year'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b92796d50>

Removing the Outliers ('Year')¶

df = df[df['Year']> df['Year'].quantile(0.01)]

Seaborn distplot 'Year' after removing outliers¶

sns.distplot(df['Year'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b926e4250>

Reset the Index¶

- The original indexes are preserved after removing observations

- If we remove observations with indexes 2 and 3, the indexes will be as: 0,1,4,5,6

- Once we reset the index, a new column will be created containing the old index 

- To drop that column, add -> 'drop=True'

df = df.reset_index(drop=True)

Descriptive Statistics after removing all outliers¶

df.describe()

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
0	BMW	4200.0	sedan	277	2.0	Petrol	yes	1991	320
1	Mercedes-Benz	7900.0	van	427	2.9	Diesel	yes	1999	Sprinter 212
2	Mercedes-Benz	13300.0	sedan	358	5.0	Gas	yes	2003	S 500
3	Audi	23000.0	crossover	240	4.2	Petrol	yes	2007	Q7
4	Toyota	18300.0	crossover	120	2.0	Petrol	yes	2011	Rav 4

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
count	4345	4173.000000	4345	4345.000000	4195.000000	4345	4345	4345.000000	4345
unique	7	NaN	6	NaN	NaN	4	2	NaN	312
top	Volkswagen	NaN	sedan	NaN	NaN	Diesel	yes	NaN	E-Class
freq	936	NaN	1649	NaN	NaN	2019	3947	NaN	199
mean	NaN	19418.746935	NaN	161.237284	2.790734	NaN	NaN	2006.550058	NaN
std	NaN	25584.242620	NaN	105.705797	5.066437	NaN	NaN	6.719097	NaN
min	NaN	600.000000	NaN	0.000000	0.600000	NaN	NaN	1969.000000	NaN
25%	NaN	6999.000000	NaN	86.000000	1.800000	NaN	NaN	2003.000000	NaN
50%	NaN	11500.000000	NaN	155.000000	2.200000	NaN	NaN	2008.000000	NaN
75%	NaN	21700.000000	NaN	230.000000	3.000000	NaN	NaN	2012.000000	NaN
max	NaN	300000.000000	NaN	980.000000	99.990000	NaN	NaN	2016.000000	NaN

	Price	Mileage	EngineV	Year
count	4025.000000	4025.000000	4025.000000	4025.000000
mean	19552.308065	163.572174	2.764586	2006.379627
std	25815.734988	103.394703	4.935941	6.695595
min	600.000000	0.000000	0.600000	1969.000000
25%	6999.000000	90.000000	1.800000	2003.000000
50%	11500.000000	158.000000	2.200000	2007.000000
75%	21900.000000	230.000000	3.000000	2012.000000
max	300000.000000	980.000000	99.990000	2016.000000

	Price	Mileage	EngineV	Year
count	3867.000000	3867.000000	3867.000000	3867.000000
mean	18194.455679	160.542539	2.450440	2006.709853
std	19085.855165	95.633291	0.949366	6.103870
min	800.000000	0.000000	0.600000	1988.000000
25%	7200.000000	91.000000	1.800000	2003.000000
50%	11700.000000	157.000000	2.200000	2008.000000
75%	21700.000000	225.000000	3.000000	2012.000000
max	129222.000000	435.000000	6.300000	2016.000000