Case Study (Car Price) :

Linear Regression - DummyVariables

Overview

- Importing the relevant libraries

- Loading data

- Dummy Variables

- Rearranging Columns

    - Columns Values

    - Reordering Columns

- Save Changes

Importing the relevant libraries

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

Loading data

In [3]:
url = "https://datascienceschools.github.io/Machine_Learning/CaseStudy/LinearRegression/carprice_editted2.csv"

df = pd.read_csv(url)

df.head()
Out[3]:
Brand Body Mileage EngineV Engine Type Registration log_price
0 BMW sedan 277 2.0 Petrol yes 8.342840
1 Mercedes-Benz van 427 2.9 Diesel yes 8.974618
2 Mercedes-Benz sedan 358 5.0 Gas yes 9.495519
3 Audi crossover 240 4.2 Petrol yes 10.043249
4 Toyota crossover 120 2.0 Petrol yes 9.814656

Dummy Variables

- It is extremely important that we drop one of the dummies
In [5]:
df = pd.get_dummies(df, drop_first=True)

df.head()
Out[5]:
Mileage EngineV log_price Brand_BMW Brand_Mercedes-Benz Brand_Mitsubishi Brand_Renault Brand_Toyota Brand_Volkswagen Body_hatch Body_other Body_sedan Body_vagon Body_van Engine Type_Gas Engine Type_Other Engine Type_Petrol Registration_yes
0 277 2.0 8.342840 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 427 2.9 8.974618 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1
2 358 5.0 9.495519 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1
3 240 4.2 10.043249 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
4 120 2.0 9.814656 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1

Rearranging Columns

- Conventionally, the most intuitive order is: 

 - Dependent variable
 - Indepedendent Numerical Variables
 - Dummy Variables

Columns Values

In [6]:
df.columns.values
Out[6]:
array(['Mileage', 'EngineV', 'log_price', 'Brand_BMW',
       'Brand_Mercedes-Benz', 'Brand_Mitsubishi', 'Brand_Renault',
       'Brand_Toyota', 'Brand_Volkswagen', 'Body_hatch', 'Body_other',
       'Body_sedan', 'Body_vagon', 'Body_van', 'Engine Type_Gas',
       'Engine Type_Other', 'Engine Type_Petrol', 'Registration_yes'],
      dtype=object)

Reordering Columns

In [7]:
cols = ['log_price', 'Mileage', 'EngineV', 'Brand_BMW',
       'Brand_Mercedes-Benz', 'Brand_Mitsubishi', 'Brand_Renault',
       'Brand_Toyota', 'Brand_Volkswagen', 'Body_hatch', 'Body_other',
       'Body_sedan', 'Body_vagon', 'Body_van', 'Engine Type_Gas',
       'Engine Type_Other', 'Engine Type_Petrol', 'Registration_yes']

df = df[cols]

df.head()
Out[7]:
log_price Mileage EngineV Brand_BMW Brand_Mercedes-Benz Brand_Mitsubishi Brand_Renault Brand_Toyota Brand_Volkswagen Body_hatch Body_other Body_sedan Body_vagon Body_van Engine Type_Gas Engine Type_Other Engine Type_Petrol Registration_yes
0 8.342840 277 2.0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 8.974618 427 2.9 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1
2 9.495519 358 5.0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1
3 10.043249 240 4.2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
4 9.814656 120 2.0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1

Save Changes

In [8]:
df.to_csv('carprice_editted3.csv', index=False)