SKLearrn (Data Preprocessing Tools)

Overview

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- Handling Missing Values

- Encoding Categorical Data

    - One Hot Encoding the Independent Variable (Country)

    - Label Encoding the Dependent Variable (Yes/NO)

- Applying Feature Scaling before or after spelitting?

- Splitting the dataset into the Training set and Test set

- Feature Scaling on Numercial data 

- Feature Scaling

Importing the Relevant libraries

In [1]:
import numpy as np
import pandas as pd

Loading the Data

In [2]:
url = "https://datascienceschools.github.io/Machine_Learning/Data_Preprocessing/Data_Preprocessing.csv"

df = pd.read_csv(url)

df.head()
Out[2]:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes

Declaring the dependent and the independent variables

- x : (Independent variables)-> Inputs or Features -> Country, Age, Salary
- y : (dependent variable)-> Output or Target -> Purchased
In [3]:
X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Handling Missing Values

- Replacing missing numerical data by their average using SimpleImputer

- SimpleImputer class from impute Module of sklearn Library

- imputer -> object of SimpleImputer class

- fit method -> looking at the missing values & computing average 

    - Numerical Columns (age & salary): X[:, 1:3]

- transform method -> replacing missing salaries & age by their average

- Update & Save Changes -> X[:, 1:3] = 
In [4]:
X[:, 1:3]
Out[4]:
array([[44.0, 72000.0],
       [27.0, 48000.0],
       [30.0, 54000.0],
       [38.0, 61000.0],
       [40.0, nan],
       [35.0, 58000.0],
       [nan, 52000.0],
       [48.0, 79000.0],
       [50.0, 83000.0],
       [37.0, 67000.0]], dtype=object)
In [5]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values= np.nan, strategy='mean')

imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])
In [6]:
X[:, 1:3]
Out[6]:
array([[44.0, 72000.0],
       [27.0, 48000.0],
       [30.0, 54000.0],
       [38.0, 61000.0],
       [40.0, 63777.77777777778],
       [35.0, 58000.0],
       [38.77777777777778, 52000.0],
       [48.0, 79000.0],
       [50.0, 83000.0],
       [37.0, 67000.0]], dtype=object)

Encoding Categorical Data

One Hot Encoding the Independent Variable (Country)

- One Hot Encoding for several categories in one feature

- ColumnTransformer class from compose module of sklearn Library 

- OneHotEncoder class from preprocessing module of sklearn Library 

- ct -> object of ColumnTransformer class

- 'encoder' -> kind of transformation 

- OneHotEncoder() -> kind of encoding 

- [0] -> indexes of the columns for applying one hot encoding

- remainder='passthrough' -> pass through all columns not specified in transformers

- ct.fit_transform(X) -> fitting & trasforming x at the same time

- Update & Save Changes -> X = 

- np.array -> machine learning model expects a numpy array
In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

X = np.array(ct.fit_transform(X))

X
Out[7]:
array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

Label Encoding the Dependent Variable (Yes/NO)

- Label Encoding for two classes in one feature

- LabelEncoder class from preprocessing module of sklearn

- le -> object of LabelEncoder class 

- le.fit_transform(y) -> fitting & trasforming y at the same time

- Update & Save Changes -> y = 
In [8]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y)

y
Out[8]:
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

Applying Feature Scaling before or after spelitting?

- apply feature scaling after splitting the dataset into the training and test set

- training set for training the model on existing observations

- test set for evaluating the performance of the model on new observations

- to make sure all training values are in the same scale

- to avoid some features to be dominated by other features & neglected by model

- Normalization: when having normal distribution in most of features

- Standardization: works well all the time

Splitting the dataset into the Training set and Test set

- train_test_split function from model_selection module of sklearn

- X -> Matrix of features -> acceptable for machine learning model

- y -> the dependent variable vector

- test_size = 0.2 -> 80% observation in training set & 20% observation in test set

- train_test_split splits arrays or matrices into random train and test subsets

- by using random_state=1, your split will be always the same

- without specifying random_state, everytime you will get a different result  
In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
In [10]:
print(X_train)
[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
In [11]:
print(y_train)
[0 1 0 0 1 1 0 1]
In [12]:
print(X_test)
[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
In [13]:
print(y_test)
[0 1]

Feature Scaling on Numercial Data

- Only apply feature scaling to numerical values not on the dummy variables

- The goal of feature scaling is to have all the values in the same range

- Standardization takes values between -3 & +3 

- Dummy variables are either 1 or 0 => already between -3 & +3 

Feature Scaling

- StandardScaler class from preprocessing module of sklearn

- sc -> object of StandardScaler class

- sc.fit_transform(X_train[:, 3:]) -> only numerical columns

        - fit will get mean & standard deviation of features

        - transform will apply formula to transform values

- sc.transform(X_test[:, 3:]) -> only numerical columns

        - transform method from the same scaler applied on X_test values

- Update & Save Changes -> X_train[:, 3:] = , X_test[:, 3:] =
In [14]:
X_train[:, 3:]
Out[14]:
array([[38.77777777777778, 52000.0],
       [40.0, 63777.77777777778],
       [44.0, 72000.0],
       [38.0, 61000.0],
       [27.0, 48000.0],
       [48.0, 79000.0],
       [50.0, 83000.0],
       [35.0, 58000.0]], dtype=object)
In [15]:
X_test[:, 3:]
Out[15]:
array([[30.0, 54000.0],
       [37.0, 67000.0]], dtype=object)
In [16]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

X_test[:, 3:] = sc.transform(X_test[:, 3:])
In [17]:
print(X_train)
[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
In [18]:
print(X_test)
[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]