Case Study (Human Resources Retention):

Classification Algorithms Preformance

In [18]:
import pandas as pd

df = pd.read_csv('hr_satisfaction.csv')

df.head()
Out[18]:
employee_id number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years department salary satisfaction_level last_evaluation
0 1003 2 157 3 0 1 0 sales low 0.38 0.53
1 1005 5 262 6 0 1 0 sales medium 0.80 0.86
2 1486 7 272 4 0 1 0 sales medium 0.11 0.88
3 1038 5 223 5 0 1 0 sales low 0.72 0.87
4 1057 2 159 3 0 1 0 sales low 0.37 0.52

Preparing Data for Machine Learning

- Converting Categorical data to Numerical data

    categorial = ['department','salary']
    df = pd.get_dummies(df, columns=categorial, drop_first=True)

- Remving the label values from training data

   X = df.drop(['left'],axis=1).values

- Assigning label values to Y dataset

   Y = df['left'].values

- Splitting data -> 70:30 Ratio Train:Test

   X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

- Data Normalization

   sc = StandardScaler()
   X_train = sc.fit_transform(X_train)
   X_test = sc.transform(X_test)
In [19]:
categorial = ['department','salary']
df = pd.get_dummies(df, columns=categorial, drop_first=True)
In [5]:
from sklearn.model_selection import train_test_split

X = df.drop(['left'],axis=1).values
Y = df['left'].values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
In [6]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Import Models & Performance Assessement Classes

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

Logestic Regression Performance

In [9]:
logreg_clf = LogisticRegression()
logreg_model = logreg_clf.fit(X_train, Y_train)
logreg_prediction = logreg_clf.predict(X_test)

Accuracy = 100*accuracy_score(logreg_prediction, Y_test)
Confusion_Matrix = confusion_matrix(logreg_prediction, Y_test)
Classification_Report = classification_report(logreg_prediction, Y_test)

print("Accuracy is {0:.2f}%\n".format(Accuracy))
print("Confusion Matrix:\n", Confusion_Matrix )
print("\nClassification Report:\n", Classification_Report )
Accuracy is 79.27%

Confusion Matrix:
 [[3151  652]
 [ 281  416]]

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.83      0.87      3803
           1       0.39      0.60      0.47       697

    accuracy                           0.79      4500
   macro avg       0.65      0.71      0.67      4500
weighted avg       0.84      0.79      0.81      4500

Random Forest Performance

In [10]:
ranfor_clf = RandomForestClassifier()
ranfor_model = ranfor_clf.fit(X_train, Y_train)
ranfor_prediction = ranfor_clf.predict(X_test)

Accuracy = 100*accuracy_score(ranfor_prediction, Y_test)
Confusion_Matrix = confusion_matrix(ranfor_prediction, Y_test)
Classification_Report = classification_report(ranfor_prediction, Y_test)

print("Accuracy is {0:.2f}%\n".format(Accuracy))
print("Confusion Matrix:\n", Confusion_Matrix )
print("\nClassification Report:\n", Classification_Report )
Accuracy is 98.29%

Confusion Matrix:
 [[3416   61]
 [  16 1007]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99      3477
           1       0.94      0.98      0.96      1023

    accuracy                           0.98      4500
   macro avg       0.97      0.98      0.98      4500
weighted avg       0.98      0.98      0.98      4500

Support Vector Machines Performance

In [12]:
svm_clf = SVC()
svm_model = svm_clf.fit(X_train, Y_train)
svm_prediction = svm_clf.predict(X_test)

Accuracy = 100*accuracy_score(svm_prediction, Y_test)
Confusion_Matrix = confusion_matrix(svm_prediction, Y_test)
Classification_Report = classification_report(svm_prediction, Y_test)

print("Accuracy is {0:.2f}%\n".format(Accuracy))
print("Confusion Matrix:\n", Confusion_Matrix )
print("\nClassification Report:\n", Classification_Report)
Accuracy is 95.49%

Confusion Matrix:
 [[3340  111]
 [  92  957]]

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.97      0.97      3451
           1       0.90      0.91      0.90      1049

    accuracy                           0.95      4500
   macro avg       0.93      0.94      0.94      4500
weighted avg       0.96      0.95      0.96      4500

KNN Classifier Performance

In [13]:
knn_clf = KNeighborsClassifier()
knn_model = knn_clf.fit(X_train, Y_train)
knn_prediction = knn_clf.predict(X_test)

Accuracy = 100*accuracy_score(knn_prediction, Y_test)
Confusion_Matrix = confusion_matrix(knn_prediction, Y_test)
Classification_Report = classification_report(knn_prediction, Y_test)

print("Accuracy is {0:.2f}%\n".format(Accuracy))
print("Confusion Matrix:\n", Confusion_Matrix )
print("\nClassification Report:\n", Classification_Report)
Accuracy is 94.07%

Confusion Matrix:
 [[3299  134]
 [ 133  934]]

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.96      0.96      3433
           1       0.87      0.88      0.87      1067

    accuracy                           0.94      4500
   macro avg       0.92      0.92      0.92      4500
weighted avg       0.94      0.94      0.94      4500

Result: Random Forest is the best

       Random Forest Accuracy is 98.29%