Case Study (Human Resources Retention):

Random Forest Feature importance

- techniques that assign scores to input features based on how useful they are at predicting a target variable
In [16]:
import pandas as pd

df = pd.read_csv('hr_satisfaction.csv')


df.head()
Out[16]:
number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years satisfaction_level last_evaluation department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical salary_low salary_medium
0 2 157 3 0 1 0 0.38 0.53 0 0 0 0 0 0 1 0 0 1 0
1 5 262 6 0 1 0 0.80 0.86 0 0 0 0 0 0 1 0 0 0 1
2 7 272 4 0 1 0 0.11 0.88 0 0 0 0 0 0 1 0 0 0 1
3 5 223 5 0 1 0 0.72 0.87 0 0 0 0 0 0 1 0 0 1 0
4 2 159 3 0 1 0 0.37 0.52 0 0 0 0 0 0 1 0 0 1 0
In [17]:
from sklearn.model_selection import train_test_split

X = df.drop(['left'],axis=1).values
Y = df['left'].values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
In [18]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
In [19]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
In [20]:
ranfor_clf = RandomForestClassifier()
ranfor_model = ranfor_clf.fit(X_train, Y_train)
ranfor_prediction = ranfor_clf.predict(X_test)

Accuracy = 100*accuracy_score(ranfor_prediction, Y_test)
Confusion_Matrix = confusion_matrix(ranfor_prediction, Y_test)
Classification_Report = classification_report(ranfor_prediction, Y_test)

print("Accuracy is {0:.2f}%\n".format(Accuracy))
print("Confusion Matrix:\n", Confusion_Matrix )
print("\nClassification Report:\n", Classification_Report )
Accuracy is 98.98%

Confusion Matrix:
 [[3386   39]
 [   7 1068]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      0.99      3425
           1       0.96      0.99      0.98      1075

    accuracy                           0.99      4500
   macro avg       0.98      0.99      0.99      4500
weighted avg       0.99      0.99      0.99      4500

Feature Importance

In [55]:
import numpy as np

feature_importances = pd.DataFrame(ranfor_clf.feature_importances_,
                                   index = pd.DataFrame(X_train).columns,
                                    columns=['importance']).sort_values('importance',ascending=False)

feature_importances 
Out[55]:
importance
5 0.313074
0 0.188460
2 0.181018
1 0.145542
6 0.127543
3 0.010874
16 0.008287
17 0.004291
13 0.003453
15 0.003449
14 0.003088
10 0.001817
4 0.001700
9 0.001691
7 0.001616
8 0.001609
11 0.001371
12 0.001117