SKLearrn (Data Preprocessing Tools)¶

Overview¶

- Importing the Relevant Libraries

- Loading the Data

- Declaring the Dependent and the Independent variables

- Handling Missing Values

- Encoding Categorical Data

    - One Hot Encoding the Independent Variable (Country)

    - Label Encoding the Dependent Variable (Yes/NO)

- Applying Feature Scaling before or after spelitting?

- Splitting the dataset into the Training set and Test set

- Feature Scaling on Numercial data 

- Feature Scaling

Importing the Relevant libraries¶

import numpy as np
import pandas as pd

Loading the Data¶

url = "https://datascienceschools.github.io/Machine_Learning/Data_Preprocessing/Data_Preprocessing.csv"

df = pd.read_csv(url)

df.head()

Declaring the dependent and the independent variables¶

- x : (Independent variables)-> Inputs or Features -> Country, Age, Salary
- y : (dependent variable)-> Output or Target -> Purchased

X = df.iloc[:, :-1].values

y = df.iloc[:, -1].values

Handling Missing Values¶

- Replacing missing numerical data by their average using SimpleImputer

- SimpleImputer class from impute Module of sklearn Library

- imputer -> object of SimpleImputer class

- fit method -> looking at the missing values & computing average 

    - Numerical Columns (age & salary): X[:, 1:3]

- transform method -> replacing missing salaries & age by their average

- Update & Save Changes -> X[:, 1:3] =

X[:, 1:3]

array([[44.0, 72000.0],
       [27.0, 48000.0],
       [30.0, 54000.0],
       [38.0, 61000.0],
       [40.0, nan],
       [35.0, 58000.0],
       [nan, 52000.0],
       [48.0, 79000.0],
       [50.0, 83000.0],
       [37.0, 67000.0]], dtype=object)

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values= np.nan, strategy='mean')

imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

X[:, 1:3]

array([[44.0, 72000.0],
       [27.0, 48000.0],
       [30.0, 54000.0],
       [38.0, 61000.0],
       [40.0, 63777.77777777778],
       [35.0, 58000.0],
       [38.77777777777778, 52000.0],
       [48.0, 79000.0],
       [50.0, 83000.0],
       [37.0, 67000.0]], dtype=object)

Encoding Categorical Data¶

One Hot Encoding the Independent Variable (Country)¶

- One Hot Encoding for several categories in one feature

- ColumnTransformer class from compose module of sklearn Library 

- OneHotEncoder class from preprocessing module of sklearn Library 

- ct -> object of ColumnTransformer class

- 'encoder' -> kind of transformation 

- OneHotEncoder() -> kind of encoding 

- [0] -> indexes of the columns for applying one hot encoding

- remainder='passthrough' -> pass through all columns not specified in transformers

- ct.fit_transform(X) -> fitting & trasforming x at the same time

- Update & Save Changes -> X = 

- np.array -> machine learning model expects a numpy array

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

X = np.array(ct.fit_transform(X))

X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

Label Encoding the Dependent Variable (Yes/NO)¶

- Label Encoding for two classes in one feature

- LabelEncoder class from preprocessing module of sklearn

- le -> object of LabelEncoder class 

- le.fit_transform(y) -> fitting & trasforming y at the same time

- Update & Save Changes -> y =

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y)

y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

Applying Feature Scaling before or after spelitting?¶

- apply feature scaling after splitting the dataset into the training and test set

- training set for training the model on existing observations

- test set for evaluating the performance of the model on new observations

- to make sure all training values are in the same scale

- to avoid some features to be dominated by other features & neglected by model

- Normalization: when having normal distribution in most of features

- Standardization: works well all the time

Splitting the dataset into the Training set and Test set¶

- train_test_split function from model_selection module of sklearn

- X -> Matrix of features -> acceptable for machine learning model

- y -> the dependent variable vector

- test_size = 0.2 -> 80% observation in training set & 20% observation in test set

- train_test_split splits arrays or matrices into random train and test subsets

- by using random_state=1, your split will be always the same

- without specifying random_state, everytime you will get a different result

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]

print(y_train)

[0 1 0 0 1 1 0 1]

print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

print(y_test)

[0 1]

Feature Scaling on Numercial Data¶

- Only apply feature scaling to numerical values not on the dummy variables

- The goal of feature scaling is to have all the values in the same range

- Standardization takes values between -3 & +3 

- Dummy variables are either 1 or 0 => already between -3 & +3

Feature Scaling¶

- StandardScaler class from preprocessing module of sklearn

- sc -> object of StandardScaler class

- sc.fit_transform(X_train[:, 3:]) -> only numerical columns

        - fit will get mean & standard deviation of features

        - transform will apply formula to transform values

- sc.transform(X_test[:, 3:]) -> only numerical columns

        - transform method from the same scaler applied on X_test values

- Update & Save Changes -> X_train[:, 3:] = , X_test[:, 3:] =

X_train[:, 3:]

array([[38.77777777777778, 52000.0],
       [40.0, 63777.77777777778],
       [44.0, 72000.0],
       [38.0, 61000.0],
       [27.0, 48000.0],
       [48.0, 79000.0],
       [50.0, 83000.0],
       [35.0, 58000.0]], dtype=object)

X_test[:, 3:]

array([[30.0, 54000.0],
       [37.0, 67000.0]], dtype=object)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

X_test[:, 3:] = sc.transform(X_test[:, 3:])

print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]

print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes