import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
url = "https://datascienceschools.github.io/cars_price.csv"
df = pd.read_csv(url)
df.head()
- Descriptive statistics: only descriptives for the numerical variables are shown
- To include the categorical ones, you should specify include='all'
- Categorical variables don't have some types of numerical descriptives
- Numerical variables don't have some types of categorical descriptives
df.describe(include='all')
- 'Model' is a categorical variable and has 306 unique values
- 306 unique values -> 306 dummy variables, making it hard to implement regression
- Let's remove it
df = df.drop('Model', axis=1)
df.isnull().sum()
df = df.dropna(axis=0)
df.isnull().sum()
df.describe()
- Displaying PDF of a variable
- The PDF will show us how that variable is distributed
- Easy way to spot anomalies, such as outliers
- PDF is often the basis on which we decide whether we want to transform a feature
- seaborn.distplot() ->
This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data
- Obviously there are some outliers
- Outliers are situated around the higher prices (right side of the graph)
- Dealing with the problem -> easily by removing 1% of the problematic samples
sns.distplot(df['Price'])
df = df[df['Price'] < df['Price'].quantile(0.99)]
sns.distplot(df['Price'])
- Obviously there are some outliers
- Outliers are situated around the higher mileage(right side of the graph)
- Dealing with the problem -> easily by removing 1% of the problematic samples
sns.distplot(df['Mileage'])
df = df[df['Mileage']< df['Mileage'].quantile(0.99)]
sns.distplot(df['Mileage'])
- Obviously there are some outliers
- Outliers are situated around the higher engine volumes(right side of the graph)
- But, car engine volumes are usually below 6.5
- Dealing with the problem -> easily by removing the problematic samples > 6.5
sns.distplot(df['EngineV'])
- Issue comes from the fact that most missing values are indicated with 99.99 or 99
- There are also some incorrect entries like 75
- Car engine volumes are usually below 6.5
df = df[df['EngineV']<6.5]
sns.distplot(df['EngineV'])
- Obviously there are some outliers
- Outliers are situated left side of the graph
- Dealing with the problem -> easily by removing 1% of the problematic samples
sns.distplot(df['Year'])
df = df[df['Year']> df['Year'].quantile(0.01)]
sns.distplot(df['Year'])
- The original indexes are preserved after removing observations
- If we remove observations with indexes 2 and 3, the indexes will be as: 0,1,4,5,6
- Once we reset the index, a new column will be created containing the old index
- To drop that column, add -> 'drop=True'
df = df.reset_index(drop=True)
df.describe()