- Importing the Relevant Libraries
- Loading the Data
- Declaring the Dependent and the Independent variables
- Handling Missing Values
- Encoding Categorical Data
- One Hot Encoding the Independent Variable (Country)
- Label Encoding the Dependent Variable (Yes/NO)
- Applying Feature Scaling before or after spelitting?
- Splitting the dataset into the Training set and Test set
- Feature Scaling on Numercial data
- Feature Scaling
import numpy as np
import pandas as pd
url = "https://datascienceschools.github.io/Machine_Learning/Data_Preprocessing/Data_Preprocessing.csv"
df = pd.read_csv(url)
df.head()
- x : (Independent variables)-> Inputs or Features -> Country, Age, Salary
- y : (dependent variable)-> Output or Target -> Purchased
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
- Replacing missing numerical data by their average using SimpleImputer
- SimpleImputer class from impute Module of sklearn Library
- imputer -> object of SimpleImputer class
- fit method -> looking at the missing values & computing average
- Numerical Columns (age & salary): X[:, 1:3]
- transform method -> replacing missing salaries & age by their average
- Update & Save Changes -> X[:, 1:3] =
X[:, 1:3]
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values= np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
X[:, 1:3]
- One Hot Encoding for several categories in one feature
- ColumnTransformer class from compose module of sklearn Library
- OneHotEncoder class from preprocessing module of sklearn Library
- ct -> object of ColumnTransformer class
- 'encoder' -> kind of transformation
- OneHotEncoder() -> kind of encoding
- [0] -> indexes of the columns for applying one hot encoding
- remainder='passthrough' -> pass through all columns not specified in transformers
- ct.fit_transform(X) -> fitting & trasforming x at the same time
- Update & Save Changes -> X =
- np.array -> machine learning model expects a numpy array
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X
- Label Encoding for two classes in one feature
- LabelEncoder class from preprocessing module of sklearn
- le -> object of LabelEncoder class
- le.fit_transform(y) -> fitting & trasforming y at the same time
- Update & Save Changes -> y =
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y
- apply feature scaling after splitting the dataset into the training and test set
- training set for training the model on existing observations
- test set for evaluating the performance of the model on new observations
- to make sure all training values are in the same scale
- to avoid some features to be dominated by other features & neglected by model
- Normalization: when having normal distribution in most of features
- Standardization: works well all the time
- train_test_split function from model_selection module of sklearn
- X -> Matrix of features -> acceptable for machine learning model
- y -> the dependent variable vector
- test_size = 0.2 -> 80% observation in training set & 20% observation in test set
- train_test_split splits arrays or matrices into random train and test subsets
- by using random_state=1, your split will be always the same
- without specifying random_state, everytime you will get a different result
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
print(X_train)
print(y_train)
print(X_test)
print(y_test)
- Only apply feature scaling to numerical values not on the dummy variables
- The goal of feature scaling is to have all the values in the same range
- Standardization takes values between -3 & +3
- Dummy variables are either 1 or 0 => already between -3 & +3
- StandardScaler class from preprocessing module of sklearn
- sc -> object of StandardScaler class
- sc.fit_transform(X_train[:, 3:]) -> only numerical columns
- fit will get mean & standard deviation of features
- transform will apply formula to transform values
- sc.transform(X_test[:, 3:]) -> only numerical columns
- transform method from the same scaler applied on X_test values
- Update & Save Changes -> X_train[:, 3:] = , X_test[:, 3:] =
X_train[:, 3:]
X_test[:, 3:]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
print(X_train)
print(X_test)