Goal
In the previous post we explored a dataset of 14999 employers data in an hypothetical company seeking to understand a pattern behind employees leaving said company. We analyzed the data, studied it and visualized many charts in hope to understand some of the underlying patterns. In this post, our goal will be to train a Machine Learning algorithm to solve the company’s problem and be able to predict whether an employee is likely to quit his job in the foreseeable future. We will use scikit-learn, a python machine learning library.
We will proceed in a different direction than the previous post, sometimes ignoring previous analysis for the sake of this post being significant on its own, although it’s important to keep in mind that the we don’t actually throw away the previously done work. Exploring a dataset as we did is really important to gain a better understanding of the context and get a feel of the data we’re going to process. This is a fundamental operation when one wish to implement a successful machine learning algorithm. Such algorithms are powerful on their own, which is the very trademark of machine learning, that is, a machine being able to learn by itself. But its development can be remarkably sped up when the human behind it knows something more about the context and is able to direct the research rather than just proceding at random. We could say that even if you’re leading to war a strong and numerous army, getting to know the battlefield where you’ll face your enemies will still improve you chances of victory.
Let’s begin by importing the dataset and preprocessing it.
Data pre-processing
We will use pandas to import the dataset.
import pandas as pd
hra = pd.read_csv('HR_comma_sep.csv')
hra.head()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | sales | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
We can use the info()
method to print a quick description of the data.
hra.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level 14999 non-null float64
last_evaluation 14999 non-null float64
number_project 14999 non-null int64
average_montly_hours 14999 non-null int64
time_spend_company 14999 non-null int64
Work_accident 14999 non-null int64
left 14999 non-null int64
promotion_last_5years 14999 non-null int64
sales 14999 non-null object
salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
We have many int
features, two floats
and two objects
. The objects, strings in our case, needs to be dealt with before feeding the data to a Machine Learning algorithm. Let’s print out the different values of the category features and their respective counts.
print(hra['salary'].value_counts())
low 7316
medium 6446
high 1237
Name: salary, dtype: int64
hra['sales'].value_counts()
sales 4140
technical 2720
support 2229
IT 1227
product_mng 902
marketing 858
RandD 787
accounting 767
hr 739
management 630
Name: sales, dtype: int64
We will deal with these categorical data later.
Let’s take a look at the numerical data with the method describe()
which provides us with standard measurements such as count, mean, standard deviation, percentiles, minimum and maximum.
hra.describe()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
---|---|---|---|---|---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
We can observe that the numerical values have very different scales, with parameters such as last_evaluation
varying from 0.0 to 1.0 and others such as average_montly_hours
ranging from 96 to 310. We will need to scale those values since some machine learning algorithms have difficulties in dealing with different scaled numbers.
From the previous table we can already see that there appears to be no missing data. Let’s check just be sure
hra.isnull().any()
satisfaction_level False
last_evaluation False
number_project False
average_montly_hours False
time_spend_company False
Work_accident False
left False
promotion_last_5years False
sales False
salary False
dtype: bool
Everything is good here, no need to deal with incomplete fields or null values.
In synthesis, to properly clean our data we need to deal with strings and scales.
Pandas also allows us to easily compute the correlation matrix, which contains the standard correlation coefficient between every pair of features, namely the the covariance of the two variables divided by the product of their standard deviations, which will assume a value between +1 and −1. We’re interested in values close to 1 or -1 as they imply respectively strong positive or negative linear correlation, whereas values close to 0 means correlation is really low. It’s not a definitive answer to our final goal but it may provide us some useful insights.
corr_matrix = hra.corr()
corr_matrix['left'].sort_values(ascending=False)
left 1.000000
time_spend_company 0.144822
average_montly_hours 0.071287
number_project 0.023787
last_evaluation 0.006567
promotion_last_5years -0.061788
Work_accident -0.154622
satisfaction_level -0.388375
Name: left, dtype: float64
We can already see confirmation of some assumptions one could have easily imagined just by reading the dataset description. satisfaction_level
has the strongest negative correlation, suggesting unsatisfied employees are very likely to leave (which should not come as a shock), whereas the strongest positive correlation is held by time_spend_company
, which means employees are more likely to leave in the first years at this company rather then further down in their career.
Feature engineering
Sometimes it helps to analyse the context and smartly play with the features in order to improve the quality of the dataset.
One idea could be to create another variable which tells us how many projects per year has each employee worked on average. We do it like this:
hra['projects_per_year'] = hra['time_spend_company']/hra['number_project']
hra[['time_spend_company','number_project','projects_per_year']].head()
time_spend_company | number_project | projects_per_year | |
---|---|---|---|
0 | 3 | 2 | 1.500000 |
1 | 6 | 5 | 1.200000 |
2 | 4 | 7 | 0.571429 |
3 | 5 | 5 | 1.000000 |
4 | 3 | 2 | 1.500000 |
Or, we could try combining the average monthly hours with the satisfaction level, which creates a new variable that should quantify how happy is the employee to be working that much/that little.
hra['satisfaction_per_hours'] = hra['average_montly_hours']/hra['satisfaction_level']
hra[['average_montly_hours','satisfaction_level','satisfaction_per_hours']].head()
average_montly_hours | satisfaction_level | satisfaction_per_hours | |
---|---|---|---|
0 | 157 | 0.38 | 413.157895 |
1 | 262 | 0.80 | 327.500000 |
2 | 272 | 0.11 | 2472.727273 |
3 | 223 | 0.72 | 309.722222 |
4 | 159 | 0.37 | 429.729730 |
corr_matrix = hra.corr()
corr_matrix['left'].sort_values(ascending=False)
left 1.000000
satisfaction_per_hours 0.407226
projects_per_year 0.181695
time_spend_company 0.144822
average_montly_hours 0.071287
number_project 0.023787
last_evaluation 0.006567
promotion_last_5years -0.061788
Work_accident -0.154622
satisfaction_level -0.388375
Name: left, dtype: float64
Our new variables, projects_per_year
and satisfaction_hours
, have the strongest positive correlations.
The interesting bit is that projects_per_year
scores 0.181695
with its generative features scoring only 0.144822
and 0.023787
. satisfaction_hours
on the other hand starts from -0.388375
, which was already a strong value, and 0.071287
to score a value of 0.407226
which looks less impressive considering what it started from.
Transformation
Dealing with strings
We will transform categorical features into one-hot vectors. For those not familiar with it, it means creating as many features as there exists different categories and have each one of these set to 1 rather than 0 only when it’s the active category.
Let’s consider the salary
feature, we earlier saw that it can assume three possible value: low
, medium
, high
. The one-hot encoding will replace the salary
column with three new columns named after the possible values. We would go from this :
… | salary | … |
---|---|---|
… | medium | … |
… | low | … |
… | high | … |
to this:
… | low | medium | high | … |
---|---|---|---|---|
… | 0 | 1 | 0 | … |
… | 1 | 0 | 0 | … |
… | 0 | 0 | 1 | … |
To do this, we will use the LabelBinarizer
component from sklearn.preprocessing
library.
Dealing with Scales
We will use standardization to scale all numberical attributes to the same range, as performed by scikit’s tool sklearn.preprocessing.StandardScaler
Mapping
One problem we face is the data format we are using: Pandas
produces DataFrame
whereas scikit-learn
deals with numpy
arrays.
Luckily, we stumbled upon this interesting library on github which deals with this exact problem by providing a bridge between the two libraries: sklearn-pandas. It allows us to define a map which we will feed Pandas
output to and will produce an output suitable for scikit-learn
.
We will not do so as of right now, we will force the Mapper to output a DataFrame by setting df_out=True
in order to ease its visualization.
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelBinarizer, StandardScaler
import numpy as np
mapper = DataFrameMapper([
('left', None),
(['satisfaction_level'], StandardScaler()),
(['last_evaluation'], StandardScaler()),
(['number_project'], StandardScaler()),
(['average_montly_hours'], StandardScaler()),
(['time_spend_company'], StandardScaler()),
(['Work_accident'], StandardScaler()),
('sales', LabelBinarizer()),
('salary',LabelBinarizer())],
df_out=True)
So this
hra.head()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | sales | salary | projects_per_year | satisfaction_per_hours | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low | 1.500000 | 413.157895 |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium | 1.200000 | 327.500000 |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium | 0.571429 | 2472.727273 |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low | 1.000000 | 309.722222 |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low | 1.500000 | 429.729730 |
Will become this
hra_prepared_df = mapper.fit_transform(hra)
hra_prepared_df.head()
/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
left | satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | sales_IT | sales_RandD | sales_accounting | sales_hr | sales_management | sales_marketing | sales_product_mng | sales_sales | sales_support | sales_technical | salary_high | salary_low | salary_medium | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | -0.936495 | -1.087275 | -1.462863 | -0.882040 | -0.341235 | -0.411165 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 1.0 | 0.752814 | 0.840707 | 0.971113 | 1.220423 | 1.713436 | -0.411165 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
2 | 1.0 | -2.022479 | 0.957554 | 2.593763 | 1.420657 | 0.343655 | -0.411165 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 1.0 | 0.431041 | 0.899131 | 0.971113 | 0.439508 | 1.028546 | -0.411165 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 1.0 | -0.976716 | -1.145699 | -1.462863 | -0.841993 | -0.341235 | -0.411165 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
We can also print out and/or save the names of the columns, which will come in handy when we actually transform it into a numpy array, therefore losing this information.
new_columns = mapper.transformed_names_
print(new_columns)
['left', 'satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'sales_IT', 'sales_RandD', 'sales_accounting', 'sales_hr', 'sales_management', 'sales_marketing', 'sales_product_mng', 'sales_sales', 'sales_support', 'sales_technical', 'salary_high', 'salary_low', 'salary_medium']
The dataset looks ready, so let’s actually turn it into numpy arrays and dive into the fun part.
actual_mapper = DataFrameMapper([
('left', None),
(['satisfaction_level'], StandardScaler()),
(['last_evaluation'], StandardScaler()),
(['number_project'], StandardScaler()),
(['average_montly_hours'], StandardScaler()),
(['time_spend_company'], StandardScaler()),
(['Work_accident'], StandardScaler()),
('sales', LabelBinarizer()),
('salary',LabelBinarizer())])
hra_prepared = actual_mapper.fit_transform(hra)
new_columns = actual_mapper.transformed_names_
hra_data = hra_prepared[:,1:]
hra_target = hra_prepared[:,0]
print(hra_prepared)
[[ 1. -0.93649469 -1.08727529 ..., 0. 1. 0. ]
[ 1. 0.75281433 0.84070693 ..., 0. 0. 1. ]
[ 1. -2.02247906 0.95755433 ..., 0. 0. 1. ]
...,
[ 1. -0.97671633 -1.08727529 ..., 0. 1. 0. ]
[ 1. -2.02247906 1.42494396 ..., 0. 1. 0. ]
[ 1. -0.97671633 -1.14569899 ..., 0. 1. 0. ]]
/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
Classification
This is clearly a binary classification problem, since our final goal would be to be accurately predict whether an employee is going to leave or not. Machine learning offers many classification algorithm, so let’s try out a few ones and see how they perform.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
Logistic regression
First we will try Logistic Regression. It’s a simple algorithm which computes a weighted sum of the input features, sum them with a bias term and output the logistic of its output.
For each model we will run, we will first use its out of the box version and then eventually tweak some parameters to increase performance where necessary.
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
scores = cross_val_score(log_reg, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.79840053 0.792 0.80533333 0.79259753 0.65655218]
Accuracy: 0.77 (+/- 0.11)
This is not that bad, but it most definitely needs some work.
Let’s try to actually add the features we introduced earlier and see if that helps:
actual_mapper = DataFrameMapper([
('left', None),
(['satisfaction_level'], StandardScaler()),
(['last_evaluation'], StandardScaler()),
(['satisfaction_per_hours'], StandardScaler()),
(['projects_per_year'], StandardScaler()),
(['number_project'], StandardScaler()),
(['average_montly_hours'], StandardScaler()),
(['time_spend_company'], StandardScaler()),
(['Work_accident'], StandardScaler()),
('sales', LabelBinarizer()),
('salary',LabelBinarizer())])
hra_prepared = actual_mapper.fit_transform(hra)
new_columns = actual_mapper.transformed_names_
hra_data = hra_prepared[:,1:]
hra_target = hra_prepared[:,0]
/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
Let’s re-run the default configuration on the new dataset
log_reg = LogisticRegression()
scores = cross_val_score(log_reg, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.82505831 0.82366667 0.84066667 0.82027342 0.69856619]
Accuracy: 0.80 (+/- 0.10)
We actually increased accuracy by 0.03! Let’s take it even further and try to find the best parameters for the logistic regressor. We start from the default configuration:
log_reg
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
Let’s do a Grid Search to test different parameters and see if we are able to get a better score out of it.
log_reg = LogisticRegression()
parameters = {'penalty':('l1', 'l2'), 'C':[1, 10], 'max_iter':[100,500,1000]}
clf = GridSearchCV(log_reg, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'C': 10, 'max_iter': 100, 'penalty': 'l2'}
log_reg = LogisticRegression(C= 10, max_iter= 100, penalty='l2')
scores = cross_val_score(log_reg, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.82539154 0.82366667 0.84133333 0.82027342 0.69923308]
Accuracy: 0.80 (+/- 0.10)
That only obtained some minor results that are not even worth considering. Let’s move to the next model and hope for something better.
SVM
SVM basically optimizes a separating hyperplane to determine the class of new data. Let’s run it and see how it performs
from sklearn import svm
svm_clf = svm.SVC()
scores = cross_val_score(svm_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.96467844 0.96066667 0.95966667 0.95931977 0.95198399]
Accuracy: 0.96 (+/- 0.01)
Not bad at all! It definitely looks more worth to invest energy into tweaking this one than the precedent one. The default configuration was:
svm_clf
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Let’s run a grid search.
svm_clf = svm.SVC()
parameters = {'kernel':('linear', 'poly','rbf','sigmoid'), 'C':[1, 10]}
clf = GridSearchCV(svm_clf, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'C': 10, 'kernel': 'rbf'}
It looks like the best estimator has C
set to 10, while keeping the rbf
kernel. Let’s see its accuracy score:
svm_clf = svm.SVC(C=10.0)
scores = cross_val_score(svm_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.97001 0.96633333 0.962 0.96565522 0.96165388]
Accuracy: 0.97 (+/- 0.01)
It does look slightly better on the scores, even though we got no outstanding result.
K-nearest neighbors
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
scores = cross_val_score(knn_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.96567811 0.94333333 0.94533333 0.95365122 0.95831944]
Accuracy: 0.95 (+/- 0.02)
Definitely promising, let’s see how we can get it to perform even better than the standard configuration:
knn_clf
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
knn_clf = KNeighborsClassifier()
parameters = {'n_neighbors':(3, 5,7,9), 'weights':('uniform', 'distance')}
clf = GridSearchCV(knn_clf, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'n_neighbors': 7, 'weights': 'distance'}
knn_clf = KNeighborsClassifier(n_neighbors=7, weights='distance')
scores = cross_val_score(knn_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.98267244 0.95433333 0.959 0.97999333 0.97865955]
Accuracy: 0.97 (+/- 0.02)
We achieved an improvement of 0.02
.
Decision Tree
Decision trees makes for very interesting model. Being pretty self-explanatory, it builds a tree of decisions based on the values of the features. It’s really interesting because it’s easy to visualize and understand the model once it’s fit to the data, which we will do.
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier()
scores = cross_val_score(tree_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.98067311 0.96533333 0.97033333 0.98632878 0.97665889]
Accuracy: 0.98 (+/- 0.01)
tree_clf
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
tree_clf = tree.DecisionTreeClassifier()
parameters = {'criterion':('gini','entropy'),
'max_depth':(2,4,6,8,10,None),
'min_samples_split':(2,5,10),
'min_samples_leaf': (1,3,5)}
clf = GridSearchCV(tree_clf, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5}
tree_clf = tree.DecisionTreeClassifier(min_samples_leaf=1,min_samples_split=2,
criterion='gini', max_depth=10)
scores = cross_val_score(tree_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.98400533 0.97666667 0.976 0.98399466 0.97432477]
Accuracy: 0.98 (+/- 0.01)
Now we may wish to visualize the decision tree that’s been produced. For the sake of visualization, we will limit depth to 3.
tree_clf = tree.DecisionTreeClassifier(max_depth=3)
tree_clf.fit(hra_data, hra_target);
import graphviz
dot_data = tree.export_graphviz(tree_clf, out_file=None,
feature_names=new_columns[1:],
class_names=['stay','left'],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("hra")
graph
The output of a decision tree gives us many powerful insights. We can see that the most important parameter to determine the first split is wheter satisfaction_level
is above or below -0.595. If said value was below the proposed mark, we look at number_project
to make an ulterior split, otherwise time_spend_company
becomes more telling.
For these reason, the decision trees are more than just a powerful algorithm to perform this task, they also make for a powerful data analysis support tool.
Random Forest
Random Forest are a typical ensemble machine learning algorithm, constitued of many Decision Trees tweaked to make more random decisions when it comes to splitting. The result is a homogenous forest which should make for much powerful decisions. Let’s prove that.
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier()
scores = cross_val_score(forest_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.99733422 0.97666667 0.98166667 0.99666556 0.99533178]
Accuracy: 0.99 (+/- 0.02)
Outstanding. The Random Forest Classifier performed close to the maximum without even having to tweak it. Let’s see if we can take it even further.
forest_clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
forest_clf = RandomForestClassifier()
parameters = {'criterion':('gini','entropy'),
'max_depth':(2,4,6,8,10,None),
'min_samples_split':(2,5,10),
'min_samples_leaf': (1,3,5)}
clf = GridSearchCV(forest_clf, parameters)
clf.fit(hra_data, hra_target);
print(clf.best_params_)
{'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
forest_clf = RandomForestClassifier(criterion='gini', max_depth=None, min_samples_leaf=1, min_samples_split=2)
scores = cross_val_score(forest_clf, hra_data, hra_target, cv=5)
print("Scores:",scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Scores: [ 0.99733422 0.97966667 0.98266667 0.99666556 0.99566522]
Accuracy: 0.99 (+/- 0.02)
Now that we hopefully managed to get the best version of each model, we wish to compare them on a wider range of performance metrics, using scikit’s metrics
package.
Performance Evaluation
We will split the dataset into a train set and a test set.
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict
And fit all the models to the training data.
models = {}
log_reg = LogisticRegression(C= 10, max_iter= 100, penalty='l2')
models["Logistic Regression"] = log_reg
svm_clf = svm.SVC(C=10.0)
models["SVM"] = svm_clf
knn_clf = KNeighborsClassifier(n_neighbors=7, weights='distance')
models["KNN"] = knn_clf
tree_clf = tree.DecisionTreeClassifier(min_samples_leaf=1,min_samples_split=2,
criterion='gini', max_depth=10)
models["Decision Tree"] = tree_clf
forest_clf = RandomForestClassifier(criterion='gini', max_depth=None,
min_samples_leaf=1, min_samples_split=2)
models["Random Forest"] = forest_clf
Scores
We will use the following scores:
Precision. The ratio true positives / (true positives + false positives).
Recall. The ratio true positives / (true positives + false negatives).
F1. This is computed as 2 * (precision * recall) / (precision + recall)
MCC. The Matthews correlation coefficient, that is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
Let’s use scikit’s classification_report
to get a quick report on precision, recall, f1-score for each class.
for name, model in models.items():
y_pred = cross_val_predict(model, hra_data,hra_target,cv=5)
print(name)
print(classification_report(hra_target, y_pred))
Logistic Regression
precision recall f1-score support
0.0 0.84 0.92 0.88 11428
1.0 0.62 0.44 0.51 3571
avg / total 0.79 0.80 0.79 14999
SVM
precision recall f1-score support
0.0 0.97 0.98 0.98 11428
1.0 0.94 0.92 0.93 3571
avg / total 0.96 0.97 0.97 14999
KNN
precision recall f1-score support
0.0 0.99 0.97 0.98 11428
1.0 0.91 0.97 0.94 3571
avg / total 0.97 0.97 0.97 14999
Decision Tree
precision recall f1-score support
0.0 0.98 0.99 0.99 11428
1.0 0.97 0.95 0.96 3571
avg / total 0.98 0.98 0.98 14999
Random Forest
precision recall f1-score support
0.0 0.99 1.00 0.99 11428
1.0 0.99 0.97 0.98 3571
avg / total 0.99 0.99 0.99 14999
It’s important to have a sense of what each model is doing on each class, averaging it out might hide some worrisome imbalance. E.g. we can see that the Logistic Regression’s recall score of 0.8 it’s actually the result of the average of a very good score of 0.92 on the 0
class and a quite bad one on the 1
class (0.44).
Now we wish to visualize the scores on a chart.
Let’s write a reusable function to compute the scores for all classifiers.
from sklearn.metrics import f1_score, recall_score, matthews_corrcoef, precision_score
scores = {}
scores['f1'] = []
scores['recall'] = []
scores['mcc'] = []
scores['precision'] = []
for name, model in models.items():
y_pred = cross_val_predict(model, hra_data,hra_target,cv=5)
scores['f1'].append(f1_score(hra_target, y_pred))
scores['recall'].append(recall_score(hra_target, y_pred))
scores['mcc'].append(matthews_corrcoef(hra_target, y_pred))
scores['precision'].append(precision_score(hra_target, y_pred))
Let’s tweak some parameters and write a few helper functions to make it more estetically pleasant.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={"figure.figsize": (20, 12)})
sns.set_style("whitegrid")
colors = sns.color_palette()
classifier_arr = range(5)
def autolabel(rects):
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
'%d' % int(height),
ha='center', va='bottom')
def setupChart(ax, scores, title, ymin, ytextoffset):
ax.set_title(title, fontsize=30, fontweight='bold')
ax.set_xticks(classifier_arr)
ax.set_xticklabels(models.keys(),{'fontsize':12})
rects = ax.bar(classifier_arr, scores, align='center', color=colors);
for i, v in enumerate(scores):
if(v > ymin):
ax.text(i-.1,v-ytextoffset, str(np.round(v,2)), color='white', fontweight='bold')
ax.set(ylim=(ymin, 1.0))
Finally, let’s visualize all the scores
f, ((ax1, ax2),(ax3,ax4)) = plt.subplots(2,2)
setupChart(ax1,scores['precision'], 'PRECISION',.0,.1)
setupChart(ax2,scores['recall'],'RECALL',.0,.1)
setupChart(ax3,scores['f1'],'F1',.0,.1)
setupChart(ax4,scores['mcc'],'MCC',.0,.1)
The logistic regressor is the only one way below the others. Let’s zoom in onto where the real competition is happening, between 0.9 and 1.0:
f, ((ax1, ax2),(ax3,ax4)) = plt.subplots(nrows=2, ncols=2)
setupChart(ax1,scores['precision'], 'PRECISION',0.9,.01)
setupChart(ax2,scores['recall'],'RECALL',0.9,.01)
setupChart(ax3,scores['f1'],'F1',0.9,.01)
setupChart(ax4,scores['mcc'],'MCC',0.9,.01)
We can clearly see that the Random Forest is the best according to almost every metrics except recall, where it is slightly overperformed by KNNs. But, the overall performance is largely superior, enought to justify the final choice of this model.
Conclusions
This second post was really interesting to write, and ultimately proved a few points. Firstly, that most of the deductions we were able to make in the previous post with the mere use of statistical data visualization were confirmed by stronger tools used here.
Additionally, while optimization is obviously important, this post also demonstrated how most of the algorithms were already performing remarkably well even in their default configurations. It’s necessary to keep in mind that this is not the result of neither randomness nor luck but rather of a well-thought default configuration carried out by the programmers behind the library.
Last but not least, that even a simple feature engineering idea was able to increase the performance of a model which is only further evidence of how fundamental it is to understand the context and build enough confidence to be able to play around with it and eventually add value to the data itself.
See you in the next post!