A casual exploration of Human Resources Analytics, Part 2: Machine Learning

Posted on

Goal

In the previous post we explored a dataset of 14999 employers data in an hypothetical company seeking to understand a pattern behind employees leaving said company. We analyzed the data, studied it and visualized many charts in hope to understand some of the underlying patterns. In this post, our goal will be to train a Machine Learning algorithm to solve the company’s problem and be able to predict whether an employee is likely to quit his job in the foreseeable future. We will use scikit-learn, a python machine learning library.

We will proceed in a different direction than the previous post, sometimes ignoring previous analysis for the sake of this post being significant on its own, although it’s important to keep in mind that the we don’t actually throw away the previously done work. Exploring a dataset as we did is really important to gain a better understanding of the context and get a feel of the data we’re going to process. This is a fundamental operation when one wish to implement a successful machine learning algorithm. Such algorithms are powerful on their own, which is the very trademark of machine learning, that is, a machine being able to learn by itself. But its development can be remarkably sped up when the human behind it knows something more about the context and is able to direct the research rather than just proceding at random. We could say that even if you’re leading to war a strong and numerous army, getting to know the battlefield where you’ll face your enemies will still improve you chances of victory.

Let’s begin by importing the dataset and preprocessing it.

Data pre-processing

We will use pandas to import the dataset.

import pandas as pd
hra = pd.read_csv('HR_comma_sep.csv')
hra.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

We can use the info() method to print a quick description of the data.

hra.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

We have many int features, two floats and two objects. The objects, strings in our case, needs to be dealt with before feeding the data to a Machine Learning algorithm. Let’s print out the different values of the category features and their respective counts.

print(hra['salary'].value_counts())
low       7316
medium    6446
high      1237
Name: salary, dtype: int64
hra['sales'].value_counts()
sales          4140
technical      2720
support        2229
IT             1227
product_mng     902
marketing       858
RandD           787
accounting      767
hr              739
management      630
Name: sales, dtype: int64

We will deal with these categorical data later.

Let’s take a look at the numerical data with the method describe() which provides us with standard measurements such as count, mean, standard deviation, percentiles, minimum and maximum.

hra.describe()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

We can observe that the numerical values have very different scales, with parameters such as last_evaluation varying from 0.0 to 1.0 and others such as average_montly_hours ranging from 96 to 310. We will need to scale those values since some machine learning algorithms have difficulties in dealing with different scaled numbers.

From the previous table we can already see that there appears to be no missing data. Let’s check just be sure

hra.isnull().any()
satisfaction_level       False
last_evaluation          False
number_project           False
average_montly_hours     False
time_spend_company       False
Work_accident            False
left                     False
promotion_last_5years    False
sales                    False
salary                   False
dtype: bool

Everything is good here, no need to deal with incomplete fields or null values.

In synthesis, to properly clean our data we need to deal with strings and scales.

Pandas also allows us to easily compute the correlation matrix, which contains the standard correlation coefficient between every pair of features, namely the the covariance of the two variables divided by the product of their standard deviations, which will assume a value between +1 and −1. We’re interested in values close to 1 or -1 as they imply respectively strong positive or negative linear correlation, whereas values close to 0 means correlation is really low. It’s not a definitive answer to our final goal but it may provide us some useful insights.

corr_matrix = hra.corr()
corr_matrix['left'].sort_values(ascending=False)
left                     1.000000
time_spend_company       0.144822
average_montly_hours     0.071287
number_project           0.023787
last_evaluation          0.006567
promotion_last_5years   -0.061788
Work_accident           -0.154622
satisfaction_level      -0.388375
Name: left, dtype: float64

We can already see confirmation of some assumptions one could have easily imagined just by reading the dataset description. satisfaction_level has the strongest negative correlation, suggesting unsatisfied employees are very likely to leave (which should not come as a shock), whereas the strongest positive correlation is held by time_spend_company, which means employees are more likely to leave in the first years at this company rather then further down in their career.

Feature engineering

Sometimes it helps to analyse the context and smartly play with the features in order to improve the quality of the dataset.

One idea could be to create another variable which tells us how many projects per year has each employee worked on average. We do it like this:

hra['projects_per_year'] = hra['time_spend_company']/hra['number_project']
hra[['time_spend_company','number_project','projects_per_year']].head()
time_spend_company number_project projects_per_year
0 3 2 1.500000
1 6 5 1.200000
2 4 7 0.571429
3 5 5 1.000000
4 3 2 1.500000

Or, we could try combining the average monthly hours with the satisfaction level, which creates a new variable that should quantify how happy is the employee to be working that much/that little.

hra['satisfaction_per_hours'] = hra['average_montly_hours']/hra['satisfaction_level']
hra[['average_montly_hours','satisfaction_level','satisfaction_per_hours']].head()
average_montly_hours satisfaction_level satisfaction_per_hours
0 157 0.38 413.157895
1 262 0.80 327.500000
2 272 0.11 2472.727273
3 223 0.72 309.722222
4 159 0.37 429.729730
corr_matrix = hra.corr()
corr_matrix['left'].sort_values(ascending=False)
left                      1.000000
satisfaction_per_hours    0.407226
projects_per_year         0.181695
time_spend_company        0.144822
average_montly_hours      0.071287
number_project            0.023787
last_evaluation           0.006567
promotion_last_5years    -0.061788
Work_accident            -0.154622
satisfaction_level       -0.388375
Name: left, dtype: float64

Our new variables, projects_per_year and satisfaction_hours, have the strongest positive correlations.

The interesting bit is that projects_per_year scores 0.181695 with its generative features scoring only 0.144822 and 0.023787. satisfaction_hours on the other hand starts from -0.388375, which was already a strong value, and 0.071287 to score a value of 0.407226 which looks less impressive considering what it started from.

Transformation

Dealing with strings

We will transform categorical features into one-hot vectors. For those not familiar with it, it means creating as many features as there exists different categories and have each one of these set to 1 rather than 0 only when it’s the active category. Let’s consider the salary feature, we earlier saw that it can assume three possible value: low, medium, high. The one-hot encoding will replace the salary column with three new columns named after the possible values. We would go from this :

salary
medium
low
high

to this:

low medium high
0 1 0
1 0 0
0 0 1

To do this, we will use the LabelBinarizer component from sklearn.preprocessing library.

Dealing with Scales

We will use standardization to scale all numberical attributes to the same range, as performed by scikit’s tool sklearn.preprocessing.StandardScaler

Mapping

One problem we face is the data format we are using: Pandas produces DataFrame whereas scikit-learn deals with numpy arrays. Luckily, we stumbled upon this interesting library on github which deals with this exact problem by providing a bridge between the two libraries: sklearn-pandas. It allows us to define a map which we will feed Pandas output to and will produce an output suitable for scikit-learn.

We will not do so as of right now, we will force the Mapper to output a DataFrame by setting df_out=True in order to ease its visualization.

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelBinarizer, StandardScaler
import numpy as np

mapper = DataFrameMapper([
    ('left', None),
    (['satisfaction_level'], StandardScaler()),
    (['last_evaluation'], StandardScaler()),
    (['number_project'], StandardScaler()),
    (['average_montly_hours'], StandardScaler()),
    (['time_spend_company'], StandardScaler()),
    (['Work_accident'], StandardScaler()),
    ('sales', LabelBinarizer()),
    ('salary',LabelBinarizer())],
    df_out=True)

So this

hra.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary projects_per_year satisfaction_per_hours
0 0.38 0.53 2 157 3 0 1 0 sales low 1.500000 413.157895
1 0.80 0.86 5 262 6 0 1 0 sales medium 1.200000 327.500000
2 0.11 0.88 7 272 4 0 1 0 sales medium 0.571429 2472.727273
3 0.72 0.87 5 223 5 0 1 0 sales low 1.000000 309.722222
4 0.37 0.52 2 159 3 0 1 0 sales low 1.500000 429.729730

Will become this

hra_prepared_df = mapper.fit_transform(hra)
hra_prepared_df.head()
/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
left satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident sales_IT sales_RandD sales_accounting sales_hr sales_management sales_marketing sales_product_mng sales_sales sales_support sales_technical salary_high salary_low salary_medium
0 1.0 -0.936495 -1.087275 -1.462863 -0.882040 -0.341235 -0.411165 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
1 1.0 0.752814 0.840707 0.971113 1.220423 1.713436 -0.411165 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
2 1.0 -2.022479 0.957554 2.593763 1.420657 0.343655 -0.411165 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
3 1.0 0.431041 0.899131 0.971113 0.439508 1.028546 -0.411165 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
4 1.0 -0.976716 -1.145699 -1.462863 -0.841993 -0.341235 -0.411165 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0

We can also print out and/or save the names of the columns, which will come in handy when we actually transform it into a numpy array, therefore losing this information.

new_columns = mapper.transformed_names_
print(new_columns)
['left', 'satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'sales_IT', 'sales_RandD', 'sales_accounting', 'sales_hr', 'sales_management', 'sales_marketing', 'sales_product_mng', 'sales_sales', 'sales_support', 'sales_technical', 'salary_high', 'salary_low', 'salary_medium']

The dataset looks ready, so let’s actually turn it into numpy arrays and dive into the fun part.

actual_mapper = DataFrameMapper([
    ('left', None),
    (['satisfaction_level'], StandardScaler()),
    (['last_evaluation'], StandardScaler()),
    (['number_project'], StandardScaler()),
    (['average_montly_hours'], StandardScaler()),
    (['time_spend_company'], StandardScaler()),
    (['Work_accident'], StandardScaler()),
    ('sales', LabelBinarizer()),
    ('salary',LabelBinarizer())])

hra_prepared = actual_mapper.fit_transform(hra)
new_columns = actual_mapper.transformed_names_
hra_data = hra_prepared[:,1:]
hra_target = hra_prepared[:,0]

print(hra_prepared)
[[ 1.         -0.93649469 -1.08727529 ...,  0.          1.          0.        ]
 [ 1.          0.75281433  0.84070693 ...,  0.          0.          1.        ]
 [ 1.         -2.02247906  0.95755433 ...,  0.          0.          1.        ]
 ...,
 [ 1.         -0.97671633 -1.08727529 ...,  0.          1.          0.        ]
 [ 1.         -2.02247906  1.42494396 ...,  0.          1.          0.        ]
 [ 1.         -0.97671633 -1.14569899 ...,  0.          1.          0.        ]]


/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

Classification

This is clearly a binary classification problem, since our final goal would be to be accurately predict whether an employee is going to leave or not. Machine learning offers many classification algorithm, so let’s try out a few ones and see how they perform.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

Logistic regression

First we will try Logistic Regression. It’s a simple algorithm which computes a weighted sum of the input features, sum them with a bias term and output the logistic of its output.

For each model we will run, we will first use its out of the box version and then eventually tweak some parameters to increase performance where necessary.

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

scores = cross_val_score(log_reg, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.79840053  0.792       0.80533333  0.79259753  0.65655218]
Accuracy: 0.77 (+/- 0.11)

This is not that bad, but it most definitely needs some work.

Let’s try to actually add the features we introduced earlier and see if that helps:

actual_mapper = DataFrameMapper([
    ('left', None),
    (['satisfaction_level'], StandardScaler()),
    (['last_evaluation'], StandardScaler()),
    (['satisfaction_per_hours'], StandardScaler()),
    (['projects_per_year'], StandardScaler()),
    (['number_project'], StandardScaler()),
    (['average_montly_hours'], StandardScaler()),
    (['time_spend_company'], StandardScaler()),
    (['Work_accident'], StandardScaler()),
    ('sales', LabelBinarizer()),
    ('salary',LabelBinarizer())])

hra_prepared = actual_mapper.fit_transform(hra)
new_columns = actual_mapper.transformed_names_
hra_data = hra_prepared[:,1:]
hra_target = hra_prepared[:,0]
/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

Let’s re-run the default configuration on the new dataset

log_reg = LogisticRegression()

scores = cross_val_score(log_reg, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.82505831  0.82366667  0.84066667  0.82027342  0.69856619]
Accuracy: 0.80 (+/- 0.10)

We actually increased accuracy by 0.03! Let’s take it even further and try to find the best parameters for the logistic regressor. We start from the default configuration:

log_reg
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Let’s do a Grid Search to test different parameters and see if we are able to get a better score out of it.

log_reg = LogisticRegression()
parameters = {'penalty':('l1', 'l2'), 'C':[1, 10], 'max_iter':[100,500,1000]}

clf = GridSearchCV(log_reg, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'C': 10, 'max_iter': 100, 'penalty': 'l2'}
log_reg = LogisticRegression(C= 10, max_iter= 100, penalty='l2')

scores = cross_val_score(log_reg, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.82539154  0.82366667  0.84133333  0.82027342  0.69923308]
Accuracy: 0.80 (+/- 0.10)

That only obtained some minor results that are not even worth considering. Let’s move to the next model and hope for something better.

SVM

SVM basically optimizes a separating hyperplane to determine the class of new data. Let’s run it and see how it performs

from sklearn import svm

svm_clf = svm.SVC()

scores = cross_val_score(svm_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.96467844  0.96066667  0.95966667  0.95931977  0.95198399]
Accuracy: 0.96 (+/- 0.01)

Not bad at all! It definitely looks more worth to invest energy into tweaking this one than the precedent one. The default configuration was:

svm_clf
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Let’s run a grid search.

svm_clf = svm.SVC()
parameters = {'kernel':('linear', 'poly','rbf','sigmoid'), 'C':[1, 10]}

clf = GridSearchCV(svm_clf, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'C': 10, 'kernel': 'rbf'}

It looks like the best estimator has C set to 10, while keeping the rbf kernel. Let’s see its accuracy score:

svm_clf = svm.SVC(C=10.0)

scores = cross_val_score(svm_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.97001     0.96633333  0.962       0.96565522  0.96165388]
Accuracy: 0.97 (+/- 0.01)

It does look slightly better on the scores, even though we got no outstanding result.

K-nearest neighbors

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()

scores = cross_val_score(knn_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.96567811  0.94333333  0.94533333  0.95365122  0.95831944]
Accuracy: 0.95 (+/- 0.02)

Definitely promising, let’s see how we can get it to perform even better than the standard configuration:

knn_clf
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
knn_clf = KNeighborsClassifier()
parameters = {'n_neighbors':(3, 5,7,9), 'weights':('uniform', 'distance')}

clf = GridSearchCV(knn_clf, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'n_neighbors': 7, 'weights': 'distance'}
knn_clf = KNeighborsClassifier(n_neighbors=7, weights='distance')

scores = cross_val_score(knn_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.98267244  0.95433333  0.959       0.97999333  0.97865955]
Accuracy: 0.97 (+/- 0.02)

We achieved an improvement of 0.02.

Decision Tree

Decision trees makes for very interesting model. Being pretty self-explanatory, it builds a tree of decisions based on the values of the features. It’s really interesting because it’s easy to visualize and understand the model once it’s fit to the data, which we will do.

from sklearn import tree

tree_clf = tree.DecisionTreeClassifier()

scores = cross_val_score(tree_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.98067311  0.96533333  0.97033333  0.98632878  0.97665889]
Accuracy: 0.98 (+/- 0.01)
tree_clf
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
tree_clf = tree.DecisionTreeClassifier()
parameters = {'criterion':('gini','entropy'),
              'max_depth':(2,4,6,8,10,None),
              'min_samples_split':(2,5,10),
              'min_samples_leaf': (1,3,5)}

clf = GridSearchCV(tree_clf, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5}
tree_clf = tree.DecisionTreeClassifier(min_samples_leaf=1,min_samples_split=2,
                                       criterion='gini', max_depth=10)

scores = cross_val_score(tree_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.98400533  0.97666667  0.976       0.98399466  0.97432477]
Accuracy: 0.98 (+/- 0.01)

Now we may wish to visualize the decision tree that’s been produced. For the sake of visualization, we will limit depth to 3.

tree_clf = tree.DecisionTreeClassifier(max_depth=3)
tree_clf.fit(hra_data, hra_target);

import graphviz
dot_data = tree.export_graphviz(tree_clf, out_file=None,
                         feature_names=new_columns[1:],  
                         class_names=['stay','left'],  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph.render("hra")
graph

svg

The output of a decision tree gives us many powerful insights. We can see that the most important parameter to determine the first split is wheter satisfaction_level is above or below -0.595. If said value was below the proposed mark, we look at number_project to make an ulterior split, otherwise time_spend_company becomes more telling.

For these reason, the decision trees are more than just a powerful algorithm to perform this task, they also make for a powerful data analysis support tool.

Random Forest

Random Forest are a typical ensemble machine learning algorithm, constitued of many Decision Trees tweaked to make more random decisions when it comes to splitting. The result is a homogenous forest which should make for much powerful decisions. Let’s prove that.

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier()

scores = cross_val_score(forest_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.99733422  0.97666667  0.98166667  0.99666556  0.99533178]
Accuracy: 0.99 (+/- 0.02)

Outstanding. The Random Forest Classifier performed close to the maximum without even having to tweak it. Let’s see if we can take it even further.

forest_clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
forest_clf = RandomForestClassifier()
parameters = {'criterion':('gini','entropy'),
              'max_depth':(2,4,6,8,10,None),
              'min_samples_split':(2,5,10),
              'min_samples_leaf': (1,3,5)}

clf = GridSearchCV(forest_clf, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
forest_clf = RandomForestClassifier(criterion='gini', max_depth=None, min_samples_leaf=1, min_samples_split=2)

scores = cross_val_score(forest_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.99733422  0.97966667  0.98266667  0.99666556  0.99566522]
Accuracy: 0.99 (+/- 0.02)

Now that we hopefully managed to get the best version of each model, we wish to compare them on a wider range of performance metrics, using scikit’s metrics package.

Performance Evaluation

We will split the dataset into a train set and a test set.

from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict

And fit all the models to the training data.

models = {}
log_reg = LogisticRegression(C= 10, max_iter= 100, penalty='l2')
models["Logistic Regression"] = log_reg

svm_clf = svm.SVC(C=10.0)
models["SVM"] = svm_clf

knn_clf = KNeighborsClassifier(n_neighbors=7, weights='distance')
models["KNN"] = knn_clf

tree_clf = tree.DecisionTreeClassifier(min_samples_leaf=1,min_samples_split=2,
                                       criterion='gini', max_depth=10)
models["Decision Tree"] = tree_clf

forest_clf = RandomForestClassifier(criterion='gini', max_depth=None,
                                    min_samples_leaf=1, min_samples_split=2)
models["Random Forest"] = forest_clf

Scores

We will use the following scores:

Precision. The ratio true positives / (true positives + false positives).

Recall. The ratio true positives / (true positives + false negatives).

F1. This is computed as 2 * (precision * recall) / (precision + recall)

MCC. The Matthews correlation coefficient, that is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.

Let’s use scikit’s classification_report to get a quick report on precision, recall, f1-score for each class.

for name, model in models.items():
    y_pred = cross_val_predict(model, hra_data,hra_target,cv=5)
    print(name)
    print(classification_report(hra_target, y_pred))
Logistic Regression
             precision    recall  f1-score   support

        0.0       0.84      0.92      0.88     11428
        1.0       0.62      0.44      0.51      3571

avg / total       0.79      0.80      0.79     14999

SVM
             precision    recall  f1-score   support

        0.0       0.97      0.98      0.98     11428
        1.0       0.94      0.92      0.93      3571

avg / total       0.96      0.97      0.97     14999

KNN
             precision    recall  f1-score   support

        0.0       0.99      0.97      0.98     11428
        1.0       0.91      0.97      0.94      3571

avg / total       0.97      0.97      0.97     14999

Decision Tree
             precision    recall  f1-score   support

        0.0       0.98      0.99      0.99     11428
        1.0       0.97      0.95      0.96      3571

avg / total       0.98      0.98      0.98     14999

Random Forest
             precision    recall  f1-score   support

        0.0       0.99      1.00      0.99     11428
        1.0       0.99      0.97      0.98      3571

avg / total       0.99      0.99      0.99     14999

It’s important to have a sense of what each model is doing on each class, averaging it out might hide some worrisome imbalance. E.g. we can see that the Logistic Regression’s recall score of 0.8 it’s actually the result of the average of a very good score of 0.92 on the 0 class and a quite bad one on the 1 class (0.44).

Now we wish to visualize the scores on a chart.

Let’s write a reusable function to compute the scores for all classifiers.

from sklearn.metrics import f1_score, recall_score, matthews_corrcoef, precision_score

scores = {}
scores['f1'] = []
scores['recall'] = []
scores['mcc'] = []
scores['precision'] = []
for name, model in models.items():
    y_pred = cross_val_predict(model, hra_data,hra_target,cv=5)
    scores['f1'].append(f1_score(hra_target, y_pred))
    scores['recall'].append(recall_score(hra_target, y_pred))
    scores['mcc'].append(matthews_corrcoef(hra_target, y_pred))
    scores['precision'].append(precision_score(hra_target, y_pred))

Let’s tweak some parameters and write a few helper functions to make it more estetically pleasant.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={"figure.figsize": (20, 12)})
sns.set_style("whitegrid")
colors = sns.color_palette()

classifier_arr = range(5)

def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')


def setupChart(ax, scores, title, ymin, ytextoffset):
    ax.set_title(title, fontsize=30, fontweight='bold')
    ax.set_xticks(classifier_arr)
    ax.set_xticklabels(models.keys(),{'fontsize':12})
    rects = ax.bar(classifier_arr, scores, align='center', color=colors);
    for i, v in enumerate(scores):
        if(v > ymin):
            ax.text(i-.1,v-ytextoffset, str(np.round(v,2)), color='white', fontweight='bold')
    ax.set(ylim=(ymin, 1.0))

Finally, let’s visualize all the scores

f, ((ax1, ax2),(ax3,ax4)) = plt.subplots(2,2)
setupChart(ax1,scores['precision'], 'PRECISION',.0,.1)
setupChart(ax2,scores['recall'],'RECALL',.0,.1)
setupChart(ax3,scores['f1'],'F1',.0,.1)
setupChart(ax4,scores['mcc'],'MCC',.0,.1)

png

The logistic regressor is the only one way below the others. Let’s zoom in onto where the real competition is happening, between 0.9 and 1.0:

f, ((ax1, ax2),(ax3,ax4)) = plt.subplots(nrows=2, ncols=2)
setupChart(ax1,scores['precision'], 'PRECISION',0.9,.01)
setupChart(ax2,scores['recall'],'RECALL',0.9,.01)
setupChart(ax3,scores['f1'],'F1',0.9,.01)
setupChart(ax4,scores['mcc'],'MCC',0.9,.01)

png

We can clearly see that the Random Forest is the best according to almost every metrics except recall, where it is slightly overperformed by KNNs. But, the overall performance is largely superior, enought to justify the final choice of this model.

Conclusions

This second post was really interesting to write, and ultimately proved a few points. Firstly, that most of the deductions we were able to make in the previous post with the mere use of statistical data visualization were confirmed by stronger tools used here.

Additionally, while optimization is obviously important, this post also demonstrated how most of the algorithms were already performing remarkably well even in their default configurations. It’s necessary to keep in mind that this is not the result of neither randomness nor luck but rather of a well-thought default configuration carried out by the programmers behind the library.

Last but not least, that even a simple feature engineering idea was able to increase the performance of a model which is only further evidence of how fundamental it is to understand the context and build enough confidence to be able to play around with it and eventually add value to the data itself.

See you in the next post!