Data Normalization

- Normalization -> to scale a variable to have a values between 0 and 1

1. Converting categorical data to numerical data
2. Scaling variables to have a values between 0 and 1
In [91]:
import pandas as pd

df = pd.read_csv('hr_satisfaction.csv')

df.head()
Out[91]:
employee_id number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years department salary satisfaction_level last_evaluation
0 1003 2 157 3 0 1 0 sales low 0.38 0.53
1 1005 5 262 6 0 1 0 sales medium 0.80 0.86
2 1486 7 272 4 0 1 0 sales medium 0.11 0.88
3 1038 5 223 5 0 1 0 sales low 0.72 0.87
4 1057 2 159 3 0 1 0 sales low 0.37 0.52

Converting Categorical data to Numerical data

Finding Categorical Columns

In [76]:
df.select_dtypes(exclude=['int', 'float']).columns
Out[76]:
Index(['department', 'salary'], dtype='object')

Converting Categorical columns to Numeric

In [77]:
categorial = ['department','salary']
df = pd.get_dummies(df, columns=categorial, drop_first=True)
df.head()
Out[77]:
employee_id number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years satisfaction_level last_evaluation department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical salary_low salary_medium
0 1003 2 157 3 0 1 0 0.38 0.53 0 0 0 0 0 0 1 0 0 1 0
1 1005 5 262 6 0 1 0 0.80 0.86 0 0 0 0 0 0 1 0 0 0 1
2 1486 7 272 4 0 1 0 0.11 0.88 0 0 0 0 0 0 1 0 0 0 1
3 1038 5 223 5 0 1 0 0.72 0.87 0 0 0 0 0 0 1 0 0 1 0
4 1057 2 159 3 0 1 0 0.37 0.52 0 0 0 0 0 0 1 0 0 1 0

Preparing Data for Machine Learning

- Remving the label values from training data

   X = df.drop(['left'],axis=1).values

- Assigning label values to Y dataset

   Y = df['left'].values

- Splitting data -> 70:30 Ratio Train:Test

   X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
In [78]:
from sklearn.model_selection import train_test_split


X = df.drop(['left'],axis=1).values
Y = df['left'].values


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

Data Normalization

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
In [87]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
In [80]:
X_train
Out[80]:
array([[-1.55717155,  1.78277092, -1.39151514, ...,  2.11045634,
        -0.97451271,  1.14833956],
       [-0.04630719,  0.96780545, -0.54707853, ..., -0.47383117,
        -0.97451271,  1.14833956],
       [ 1.85219558, -0.66212548, -1.65288838, ..., -0.47383117,
         1.02615388, -0.87082256],
       ...,
       [-1.37897637, -1.47709095, -0.96929684, ...,  2.11045634,
         1.02615388, -0.87082256],
       [ 0.36777783,  0.15283999, -0.84866303, ...,  2.11045634,
        -0.97451271,  1.14833956],
       [ 1.62176543, -1.47709095, -1.17035317, ..., -0.47383117,
        -0.97451271,  1.14833956]])
In [81]:
X_test
Out[81]:
array([[ 1.28574245, -0.66212548,  1.28253414, ..., -0.47383117,
        -0.97451271,  1.14833956],
       [-0.08928777, -0.66212548, -0.60739543, ..., -0.47383117,
        -0.97451271,  1.14833956],
       [-0.15182259, -0.66212548, -0.32591656, ..., -0.47383117,
        -0.97451271,  1.14833956],
       ...,
       [ 1.41948564,  0.96780545,  0.49841443, ..., -0.47383117,
        -0.97451271, -0.87082256],
       [-1.53905   , -1.47709095, -1.27088134, ...,  2.11045634,
         1.02615388, -0.87082256],
       [ 0.17250658, -0.66212548,  0.7195764 , ...,  2.11045634,
         1.02615388, -0.87082256]])
In [82]:
df_X_train = pd.DataFrame(X_train)

df_X_train.head()
Out[82]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 -1.557172 1.782771 -1.391515 0.344839 2.427680 -0.151642 -1.943668 0.130402 -0.23149 -0.230575 -0.227581 -0.205648 -0.244887 -0.24858 -0.624387 -0.418537 2.110456 -0.974513 1.148340
1 -0.046307 0.967805 -0.547079 -0.342940 2.427680 -0.151642 -0.371133 -0.983530 -0.23149 -0.230575 -0.227581 -0.205648 -0.244887 -0.24858 1.601570 -0.418537 -0.473831 -0.974513 1.148340
2 1.852196 -0.662125 -1.652888 -1.030719 2.427680 -0.151642 1.523973 0.013146 -0.23149 -0.230575 -0.227581 -0.205648 4.083514 -0.24858 -0.624387 -0.418537 -0.473831 1.026154 -0.870823
3 -1.232571 -1.477091 -0.366128 0.344839 -0.411916 -0.151642 0.394974 -0.866274 -0.23149 -0.230575 -0.227581 -0.205648 -0.244887 -0.24858 1.601570 -0.418537 -0.473831 1.026154 -0.870823
4 -1.311292 1.782771 2.066654 0.344839 -0.411916 -0.151642 -2.024310 0.482170 -0.23149 -0.230575 -0.227581 -0.205648 -0.244887 -0.24858 -0.624387 -0.418537 2.110456 1.026154 -0.870823
In [83]:
df_X_test = pd.DataFrame(X_test)

df_X_test.head()
Out[83]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 1.285742 -0.662125 1.282534 -0.342940 -0.411916 -0.151642 -0.451776 0.306286 -0.23149 -0.230575 -0.227581 -0.205648 -0.244887 -0.24858 -0.624387 -0.418537 -0.473831 -0.974513 1.148340
1 -0.089288 -0.662125 -0.607395 0.344839 2.427680 -0.151642 -1.258204 1.478845 -0.23149 -0.230575 -0.227581 4.862668 -0.244887 -0.24858 -0.624387 -0.418537 -0.473831 -0.974513 1.148340
2 -0.151823 -0.662125 -0.325917 -0.342940 -0.411916 -0.151642 -0.290490 0.306286 -0.23149 -0.230575 -0.227581 -0.205648 -0.244887 -0.24858 -0.624387 2.389273 -0.473831 -0.974513 1.148340
3 -0.084486 -1.477091 -0.647607 1.032619 -0.411916 -0.151642 -1.217882 -1.745694 -0.23149 -0.230575 -0.227581 -0.205648 -0.244887 -0.24858 -0.624387 -0.418537 -0.473831 1.026154 -0.870823
4 -0.855504 0.152840 -0.044438 0.344839 -0.411916 -0.151642 0.112724 1.361589 -0.23149 4.336984 -0.227581 -0.205648 -0.244887 -0.24858 -0.624387 -0.418537 -0.473831 -0.974513 1.148340

Result : Data is ready to create and fit your model