- Importing the relevant libraries
- Loading data
- Exploring the Descriptive Statistics
- Checking Missing Values
- Removing Missing Values
- Dropping 'Model'
- Exploring Descriptives Statistics without the Missing Values
- Exploring Probability Distribution Function (PDF)
- Seaborn distplot ('Price')
- Removing the Outliers ('Price')
- Seaborn distplot ('Price') after removing outliers
- Seaborn distplot 'Mileage'
- Removing the Outliers ('Mileage')
- Seaborn distplot 'Mileage' after removing outliers
- Seaborn distplot 'EngineV'
- Removing the Outliers ('EngineV')
- Seaborn distplot 'EngineV' after removing outliers
- Seaborn distplot 'Year'
- Removing the Outliers ('Year')
- Seaborn distplot 'Mileage' after removing outliers
- Reset the Index (after removing observations)
- Descriptive Statistics after removing all outliers
- Saving Cleaned Data
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
- The first potential aggressor is brand
as it is well-known that a BMW is generally more expensive than a Toyota
- The second relevant variable is mileage
since the more a car is driven the cheaper it should be
- Third the engine volume
sports cars have larger engines and economy cars have smaller engines
- The final variable is year of production
The older the car the cheaper it is with the exception of vintage vehicles
- The rest are categorical variables which we'll deal with on a case by case basis
url = "https://datascienceschools.github.io/Machine_Learning/CaseStudy/LinearRegression/cars_price.csv"
df = pd.read_csv(url)
df.head()
- Descriptive Statistics: an easy way to check the data and spot problems
- Descriptive statistics: Defualt: only descriptives for the numerical variables
- To include the categorical ones, you should specify include='all'
- Categorical variables don't have some types of numerical descriptives
- Numerical variables don't have some types of categorical descriptives
- Descriptive statistics: Defualt: only descriptives for the numerical variables
- To include the categorical ones, you should specify include='all'
- Categorical variables don't have some types of numerical descriptives
- Numerical variables don't have some types of categorical descriptives
- Count: each variable has a different number of observations
-> impling there are some missing values
- Unique: unique entries for categorical variables
-> 312 unique models = 312 dummies -> really hard to implement regressioon
df.describe(include='all')
df.isnull().sum()
- axis=0 means rows while axis=1 stands for columns
df = df.dropna(axis=0)
df.isnull().sum()
- Model is a categorical variable with 306 unique values
- 306 unique values -> 306 dummy variables, making it hard to implement regression
- Let's remove it
- axis=0 means rows while axis=1 stands for columns
df = df.drop('Model', axis=1)
- Price:
mean 19552.308065
min 600.000000
25% 6999.000000
50% 11500.000000
75% 21900.000000
max 300000.000000
- Obviously we have a few outliers in the price variable
- Outliers: Observations that lie on abnormal distance from other observations
- Outliers will affect the regression dramatically in cost coefficients to be inflated as the regression will try to place the line closer to those values
- A similar issue can be seen in mileage, engine volume and year
df.describe()
- Displaying PDF of a variable
- The PDF will show us how that variable is distributed
- Easy way to spot anomalies, such as outliers
- PDF is often the basis on which we decide whether we want to transform a feature
- seaborn.distplot() ->
This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data
- Obviously there are some outliers
- Outliers are situated around the higher prices (right side of the graph)
- Dealing with the problem -> easily by removing 1% of the problematic observations
sns.distplot(df['Price'])
- Eliminate Outliers by keeping the data below the 99 percentile
df = df[df['Price'] < df['Price'].quantile(0.99)]
- While the maximum value is still far away from the mean, it is acceptably closer
sns.distplot(df['Price'])
- Obviously there are some outliers
- Outliers are situated around the higher mileage(right side of the graph)
- Dealing with the problem -> easily by removing 1% of the problematic samples
sns.distplot(df['Mileage'])
- Eliminate Outliers by keeping the data below the 99 percentile
df = df[df['Mileage']< df['Mileage'].quantile(0.99)]
sns.distplot(df['Mileage'])
- Obviously there are some outliers
- Outliers are situated around the higher engine volumes(right side of the graph)
- Car engine volumes are usually below 6.5
- Dealing with the problem -> easily by removing the problematic samples > 6.5
sns.distplot(df['EngineV'])
- Issue comes from the fact that most missing values are indicated with 99.99 or 99
- There are also some incorrect entries like 75
- Car engine volumes are usually below 6.5
- Eliminate Outliers by keeping the data below 6.5
df = df[df['EngineV']<6.5]
sns.distplot(df['EngineV'])
- Obviously there are some outliers
- Outliers are situated left side of the graph
- Dealing with the problem -> easily by removing 1% of the problematic samples
sns.distplot(df['Year'])
- Eliminate Outliers by keeping the data higher than first percentile
df = df[df['Year']> df['Year'].quantile(0.01)]
sns.distplot(df['Year'])
- The original indexes are preserved after removing observations
- If we remove observations with indexes 2 and 3, the indexes will be as: 0,1,4,5,6
- Once we reset the index, a column will be created containing the old index
- To drop that column, add -> 'drop=True'
df = df.reset_index(drop=True)
df.describe()
df.to_csv('carprice_editted.csv', index=False)