Case Study (Car Price) :¶

-  Predicting the price of a used car depending on its specifications

Data Preprocessing - Dealing with Outliers¶

Overview¶

- Importing the relevant libraries

- Loading data

- Exploring the Descriptive Statistics 

- Checking Missing Values

- Removing Missing Values

- Dropping 'Model'

- Exploring Descriptives Statistics without the Missing Values

- Exploring Probability Distribution Function (PDF)

- Seaborn distplot ('Price')

- Removing the Outliers ('Price')

- Seaborn distplot ('Price') after removing outliers

- Seaborn distplot 'Mileage'

- Removing the Outliers ('Mileage')

- Seaborn distplot 'Mileage' after removing outliers

- Seaborn distplot 'EngineV' 

- Removing the Outliers ('EngineV')

- Seaborn distplot 'EngineV' after removing outliers

- Seaborn distplot 'Year'

- Removing the Outliers ('Year')

- Seaborn distplot 'Mileage' after removing outliers

- Reset the Index (after removing observations)

- Descriptive Statistics after removing all outliers

- Saving Cleaned Data

Importing the relevant libraries¶

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Loading data¶

-  The first potential aggressor is brand 
    as it is well-known that a BMW is generally more expensive than a Toyota

- The second relevant variable is mileage 
    since the more a car is driven the cheaper it should be

- Third the engine volume 
    sports cars have larger engines and economy cars have smaller engines 

- The final variable is year of production
    The older the car the cheaper it is with the exception of vintage vehicles

- The rest are categorical variables which we'll deal with on a case by case basis

url = "https://datascienceschools.github.io/Machine_Learning/CaseStudy/LinearRegression/cars_price.csv"

df = pd.read_csv(url)

df.head()

Exploring the Descriptive Statistics¶

- Descriptive Statistics: an easy way to check the data and spot problems 

- Descriptive statistics: Defualt: only descriptives for the numerical variables 

- To include the categorical ones, you should specify include='all'

- Categorical variables don't have some types of numerical descriptives

- Numerical variables don't have some types of categorical descriptives  

- Descriptive statistics: Defualt: only descriptives for the numerical variables 

- To include the categorical ones, you should specify include='all'

- Categorical variables don't have some types of numerical descriptives

- Numerical variables don't have some types of categorical descriptives

- Count: each variable has a different number of observations 

    -> impling there are some missing values

- Unique: unique entries for categorical variables 

    -> 312 unique models = 312 dummies -> really hard to implement regressioon

df.describe(include='all')

Checking Missing Values¶

df.isnull().sum()

Brand             0
Price           172
Body              0
Mileage           0
EngineV         150
Engine Type       0
Registration      0
Year              0
Model             0
dtype: int64

Removing Missing Values¶

- axis=0 means rows while axis=1 stands for columns

df = df.dropna(axis=0)

df.isnull().sum()

Brand           0
Price           0
Body            0
Mileage         0
EngineV         0
Engine Type     0
Registration    0
Year            0
Model           0
dtype: int64

Dropping 'Model'¶

- Model is a categorical variable with 306 unique values
- 306 unique values -> 306 dummy variables, making it hard to implement regression
- Let's remove it
- axis=0 means rows while axis=1 stands for columns

df = df.drop('Model', axis=1)

Exploring Descriptives Statistics without the Missing Values¶

    - Price:

        mean      19552.308065
        min         600.000000
        25%        6999.000000
        50%       11500.000000
        75%       21900.000000
        max      300000.000000

    - Obviously we have a few outliers in the price variable

    - Outliers: Observations that lie on abnormal distance from other observations 

    - Outliers will affect the regression dramatically in cost coefficients to be inflated as the regression will try to place the line closer to those values

    - A similar issue can be seen in mileage, engine volume and year

df.describe()

Exploring Probability Distribution Function (PDF)¶

- Displaying PDF of a variable

- The PDF will show us how that variable is distributed 

- Easy way to spot anomalies, such as outliers

- PDF is often the basis on which we decide whether we want to transform a feature

- seaborn.distplot() -> 

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data

Seaborn distplot ('Price')¶

- Obviously there are some outliers 

- Outliers are situated around the higher prices (right side of the graph)

- Dealing with the problem -> easily by removing  1% of the problematic observations

sns.distplot(df['Price'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f58c340c110>

Removing the Outliers ('Price')¶

   - Eliminate Outliers by keeping the data below the 99 percentile

df = df[df['Price'] < df['Price'].quantile(0.99)]

Seaborn distplot ('Price') after removing outliers¶

- While the maximum value is still far away from the mean, it is acceptably closer

sns.distplot(df['Price'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f58c2b09410>

Seaborn distplot 'Mileage'¶

- Obviously there are some outliers 

- Outliers are situated around the higher mileage(right side of the graph)

- Dealing with the problem -> easily by removing  1% of the problematic samples

sns.distplot(df['Mileage'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f58c2a19550>

Removing the Outliers ('Mileage')¶

    - Eliminate Outliers by keeping the data below the 99 percentile

df = df[df['Mileage']< df['Mileage'].quantile(0.99)]

Seaborn distplot 'Mileage' after removing outliers¶

sns.distplot(df['Mileage'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f58c299eb90>

Seaborn distplot 'EngineV'¶

- Obviously there are some outliers 

- Outliers are situated around the higher engine volumes(right side of the graph)

- Car engine volumes are usually below 6.5

- Dealing with the problem -> easily by removing the problematic samples > 6.5

sns.distplot(df['EngineV'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f58c28e1cd0>

Removing the Outliers ('EngineV')¶

- Issue comes from the fact that most missing values are indicated with 99.99 or 99

- There are also some incorrect entries like 75

- Car engine volumes are usually below 6.5

- Eliminate Outliers by keeping the data below 6.5

df = df[df['EngineV']<6.5]

Seaborn distplot 'EngineV' after removing outliers¶

sns.distplot(df['EngineV'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f58c2786710>

Seaborn distplot 'Year'¶

- Obviously there are some outliers 

- Outliers are situated left side of the graph

- Dealing with the problem -> easily by removing  1% of the problematic samples

sns.distplot(df['Year'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f58c26ea850>

Removing the Outliers ('Year')¶

- Eliminate Outliers by keeping the data higher than first percentile

df = df[df['Year']> df['Year'].quantile(0.01)]

Seaborn distplot 'Year' after removing outliers¶

sns.distplot(df['Year'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f58c264e850>

Reset the Index (after removing observations)¶

- The original indexes are preserved after removing observations

- If we remove observations with indexes 2 and 3, the indexes will be as: 0,1,4,5,6

- Once we reset the index, a column will be created containing the old index 

- To drop that column, add -> 'drop=True'

df = df.reset_index(drop=True)

Exploring descriptive statistics after removing all outliers¶

df.describe()

Saving Cleaned Data¶

df.to_csv('carprice_editted.csv', index=False)

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
0	BMW	4200.0	sedan	277	2.0	Petrol	yes	1991	320
1	Mercedes-Benz	7900.0	van	427	2.9	Diesel	yes	1999	Sprinter 212
2	Mercedes-Benz	13300.0	sedan	358	5.0	Gas	yes	2003	S 500
3	Audi	23000.0	crossover	240	4.2	Petrol	yes	2007	Q7
4	Toyota	18300.0	crossover	120	2.0	Petrol	yes	2011	Rav 4

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
count	4345	4173.000000	4345	4345.000000	4195.000000	4345	4345	4345.000000	4345
unique	7	NaN	6	NaN	NaN	4	2	NaN	312
top	Volkswagen	NaN	sedan	NaN	NaN	Diesel	yes	NaN	E-Class
freq	936	NaN	1649	NaN	NaN	2019	3947	NaN	199
mean	NaN	19418.746935	NaN	161.237284	2.790734	NaN	NaN	2006.550058	NaN
std	NaN	25584.242620	NaN	105.705797	5.066437	NaN	NaN	6.719097	NaN
min	NaN	600.000000	NaN	0.000000	0.600000	NaN	NaN	1969.000000	NaN
25%	NaN	6999.000000	NaN	86.000000	1.800000	NaN	NaN	2003.000000	NaN
50%	NaN	11500.000000	NaN	155.000000	2.200000	NaN	NaN	2008.000000	NaN
75%	NaN	21700.000000	NaN	230.000000	3.000000	NaN	NaN	2012.000000	NaN
max	NaN	300000.000000	NaN	980.000000	99.990000	NaN	NaN	2016.000000	NaN

	Price	Mileage	EngineV	Year
count	4025.000000	4025.000000	4025.000000	4025.000000
mean	19552.308065	163.572174	2.764586	2006.379627
std	25815.734988	103.394703	4.935941	6.695595
min	600.000000	0.000000	0.600000	1969.000000
25%	6999.000000	90.000000	1.800000	2003.000000
50%	11500.000000	158.000000	2.200000	2007.000000
75%	21900.000000	230.000000	3.000000	2012.000000
max	300000.000000	980.000000	99.990000	2016.000000

	Price	Mileage	EngineV	Year
count	3867.000000	3867.000000	3867.000000	3867.000000
mean	18194.455679	160.542539	2.450440	2006.709853
std	19085.855165	95.633291	0.949366	6.103870
min	800.000000	0.000000	0.600000	1988.000000
25%	7200.000000	91.000000	1.800000	2003.000000
50%	11700.000000	157.000000	2.200000	2008.000000
75%	21700.000000	225.000000	3.000000	2012.000000
max	129222.000000	435.000000	6.300000	2016.000000