A casual exploration of Human Resources Analytics, Part 1: Data Visualization

Posted on

Data is fascinating. To me, it’s no surprise that we are moving toward an ever-increasing production of data, to the point that each day in our life we (and even our everyday tools) produce and consume huge amounts of data. This data describes our world, from our everyday life to our bank account, from the railway system of a region to the economic movements of entire nations. It’s huge, and it’s getting huger with every seconds that goes by. With this increase comes a significant problem: too much information equals zero information. This much information only means something if we’re able to process it, read it and extract the right informations from it.

This is where Data Visualization comes into play and we’re given the task to make sense of millions of numbers, strings and other types of data, process them as appropriate and put that data into a clear, efficient representation which makes it easier for everyone to extract important information in the blink of an eye. It’s a fascinating process in my opinion and many others feels likewise.

This is why I decided to write a post to present a typical process of exploration of a dataset.

Let’s grab our tools:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

and begin our exploration.

The Dataset

The dataset was made available on Kaggle at Human Resources Analytics. It’s a simulated dataset of 14999 employers data in an hypothetical company seeking to understand a pattern behind employees leaving said company. Fields in the dataset are described as follows:

Field Meaning
satisfaction_level Level of satisfaction (0-1)
last_evaluation Time since last performance evaluation (in Years)
number_project Number of projects completed while at work
average_montly_hours Average monthly hours at workplace
time_spend_company Number of years spent in the company
Work_accident Whether the employee had a workplace accident
left Whether the employee left the workplace or not (1 or 0) Factor
promotion_last_5years Whether the employee was promoted in the last five years
sales Department in which they work for
salary Relative level of salary (high)

Let’s begin with reading the dataset and printing its first few lines to get a sense of what we’re dealing with.

hra = pd.read_csv('HR_comma_sep.csv')
hra.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

The immediate next step is to check for the dataset completeness and whether we are to deal with null values and the likes.

hra.isnull().any()
satisfaction_level       False
last_evaluation          False
number_project           False
average_montly_hours     False
time_spend_company       False
Work_accident            False
left                     False
promotion_last_5years    False
sales                    False
salary                   False
dtype: bool

Everything looks good here. Let’s rename some columns for ease of use.

hra=hra.rename(columns = {'sales':'department','time_spend_company':'company_years'})
hra.head()
satisfaction_level last_evaluation number_project average_montly_hours company_years Work_accident left promotion_last_5years department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Let’s buckle up and dive in!

Target

Obviously, our target feature is the left column, which express whether or not an employee left the company. The company’s desire is to understand how this feature might be related to the others and how could the others help them predict whether an employee is likely to be soon leaving the company in order to take appropriate countermeasures (propose him a raise, assigning him to another project, maybe fire him before he can leave on his own or whatever smart strategy they want to put in place).

As per use, before we start exploring random features hoping to be lucky enough to stumble upon some valuable insight, we should think about the context we’re dealing with. Our ultimate purpose is to understand why do employees tend to leave a certain position and before we said that one of the first countermeasure could be to increase their salary. Salary is a good reason for people to seek a better position at another company, so it could be an interesting parameter to explore. Firstly, let’s see how many different values we have of said feature.

print(hra['salary'].unique())
['low' 'medium' 'high']

Just three salary range, which could mean somebody already processed the numeric data and decided this was the most appropriate form to present it. This quantisation operation might have cut out significant data, whereas a wider range of salaries might have contained some useful information. A certain employee might have left, not because his salary was in the high range but because his salary was in the lower end of the high range and he considered himself of being able to strive for more. These types of consideration are usually to be made before preprocessing such a dataset altough in some cases though we might be limited in our freedom because of privacy preserving purposes and the likes.

Anyway, this is what we are to make the most out of now so let’s. We need a clear, visual way to present this informations, and that’s where Seaborn comes to the rescue.

Seaborn comes to the rescue.

I recently happened to play around with Seaborn, a Python visualization library. It’s based on Matplotlib and it caught my attention for its ease of use when dealing with statistical data.

Let me show you what I’m talking about by trying to visualize the count of leaving/staying employees divided by salary ranges. First, I will do so using Matplotlib and only then, after having hardly earned our result, we will do the same thing with Seaborn.

First, we need to create three different datasets

salary_datasets = []
salary_datasets.append(hra[hra['salary']=='low']['left'])
salary_datasets.append(hra[hra['salary']=='medium']['left'])
salary_datasets.append(hra[hra['salary']=='high']['left'])

While this solution works, it’s obviously not reusable. Let’s write a more generalized version:

salary_datasets = []
for salary in hra["salary"].unique():
    salary_datasets.append(hra[hra["salary"]==salary]["left"])

Cool. The way we visualize one count plot is quite straightforward and we could think of visualizing the tree of them side by side.

plt.figure(1,figsize=(24, 6))
plt.subplot(131)
plt.xlabel("left")
plt.ylabel('count')
plt.title("low")
plt.hist(salary_datasets[0])
plt.subplot(132)
plt.xlabel("left")
plt.ylabel('count')
plt.title("medium")
plt.hist(salary_datasets[1])
plt.subplot(133)
plt.xlabel("left")
plt.ylabel('count')
plt.title("high")
plt.hist(salary_datasets[1])

plt.show()

png

I also had to add more labels, to make it more readable. This is better but still not optimal. We wish to make an immediate comparison and all this space certainly doesn’t help. Let’s put those bars side by side and fix some more parameters.

plt.figure(figsize=(10,6))
plt.hist(salary_datasets, bins=[0.0,0.5,1.0])
plt.xlabel("left")
plt.ylabel('count')
plt.gca().yaxis.grid(True)
plt.gca().set_axisbelow(True)
plt.xticks([0,1])
plt.legend(hra["salary"].unique())
plt.show()

png

Now we’re talking. The visualization still isn’t optimal, with the x axis labels all out of place. Let’s step up our game with Seaborn.

First, we need to import the library set the figure size along with a style which I prefer for its cleanliness.

import seaborn as sns
sns.set(rc={"figure.figsize": (10, 6)})
sns.set_style("whitegrid")

Now, to the real magic. We will use seaborn.countplot, which shows the counts of observations in each categorical bin using bars. We need absolutely no preprocessing of the dataset as it will be an input parameter of the function itself, along with the feature which values we wish to count and an extra parameter, namely hue which will add a nested categorical variable.

sns.countplot(x="left", hue="salary", data=hra);

png

That’s it. In just one, simple line of code we get the desired visualization. It was easy to obtain, clear to visualize and also better looking even with the defaults parameters. Keeping in mind that, if one wished to, Seaborn comes with many other functions to make the chart look even better.

This simple example is one of many that could shows how much Seaborn is well suited for fast prototyping when it comes to data exploration and visualization, allowing us to easily visualize one or many features in comparison with the support of statistical functions that help us make the most of each chart.

Obviously it comes at a price that any seasoned programmer should have learned by now which is that every high-level library provides efficient and beautiful ways to do the functions it is intended for and not that big flexibility to do something else. Luckily enough, the library was intended for doing exactly what we intend to do so it should suit our needs.

Data Exploration

We spent so much time and words on how to produce that chart, let’s make use of that effort now.

Salary

sns.countplot(x="left", hue="salary", data=hra, palette=sns.color_palette("BuGn_r"));

png

This chart already gives us a powerful yet predictable insight. Even by taking into account the proportions, the lower the salary the higher the numbers of left employees (duh). Simple observations about our scope leads to an immediate better understanding of the problem.

Satisfaction Level

Another obvious parameter at least worth looking at could be satisfaction_level so let’s.

left_palette = sns.xkcd_palette(["faded green","pale red"])
sns.stripplot(x="left",y="satisfaction_level",data=hra,palette=left_palette);

png

This is not very useful, let’s tweak a few parameters to increase readability.

sns.stripplot(x="left",y="satisfaction_level", data=hra, jitter=True, alpha=.2,palette=left_palette);

png

This is more like it. We can easily see that the majority of those who didn’t leave reported an high satisfaction which was to be expected. The satisfaction levels of the employees who left is far more interesting.

They are divided in three major ranges with very little in between. Not surprisingly, there is a high concentration around 0 and another around 0.4 which could be a good warning sign of people wanting to leave. But we can also spot a remarkable concentration in the higher end of the spectrum which shouldn’t necessarily surprise us, two out of many possible explanation for this phenomena come to mind: - they were happy at this company but received a better offer. - people, sometimes, lie. We don’t have info about how this satisfaction level was obtained but in case it was self-reported, it wouldn’t be wrong to consider such a scenario.

Let’s visualize the same data via a boxplot.

sns.boxplot(x="left", y="satisfaction_level", data=hra,palette=left_palette);

png

This looks pretty straightforward. More than 50% of the employees who left had a satisfaction level lower then 0.4, whereas only 25% of the employee who haven’t left yet have it below 0.6. Using more than one visualization allowed us for a deeper understanding of the distributions.

Salary Through the Years

Next thing I’m interested in visualising is the salary range with respect to the years spent at the company.

sns.factorplot(x="salary", y="left", col="company_years", data=hra, col_wrap=4,);

png

Now this is a really interesting one. It tells us that it is very unlikely for somebody to leave during their second year or after having spent more than 7 at this company. The years in the middle are much more telling. After having spent 3 years at the company, there is a tendency of people leaving, especially in the lower salary range. This phenomena increases steadily for the next few years, culminating at 5 years, when the employees with a lower salary are the most likely to leave. The 6th year percentage of leaving already starts to decrease toward the flatline observed from 7 years on.

Company’s Strength

Considering the context of our analysis, department could be an important information to look into. Some companies’s department may be very efficient, well-paid and satisfying to work at whereas some other might not hold the same standard. We may wish to visualize the leavings for each department.

sns.factorplot("left", col="department", col_wrap=5, data=hra,
               kind="count", size=3, aspect=.8);

png

While there doesn’t look to be any extraordinary case of more people leaving than staying, we can see, by carefully looking at the proportions of the two bars, that some departments are more prone to be left than other. e.g. The number of people leaving the hr or accounting department is half the number of the people staying whereas this number only makes up for a third in the sales department.

Hard work pays off

A feature that surely deserves our attention in this context is average_montly_hours. An employee working too hard could be dissatisfied. Let’s visualize the distribution of the feature to understand what is to be referred as normal.

sns.distplot(hra["average_montly_hours"]);

png

We can easily see that most people are working from 150 to 250 hours a week. Let’s see how the distributions vary with relation to the left target feature.

sns.violinplot(x="left", y="average_montly_hours", data=hra,palette=left_palette);

png

We can notice that, not surprisingly, people working the most have a tendency to leave. A much more interesting pattern is that of people working very few hours and quitting the company.

This feature is interesting to observe but not much useful on its own. Its combination with the salary one, or even with the satisfaction_level parameter could tell us more on whether the long hours are worth it or not.

sns.violinplot(x="salary", y="average_montly_hours", hue="left", data=hra, split=True,palette=left_palette);

png

Excluding some slight variations, the distributions looks very similar for every compensation range, which is actually interesting as one could have expected differently.

Work accomplishments

Too little work or too much work could have a great impact on one’s job satisfaction. Staying on the same project for too long/little could be a problem for somebody and a perk for somebody else. Anyway, it’s a feature worth considering. Let’s visualize the count of leaving people per number projects.

sns.countplot('number_project', hue='left', data=hra,palette=left_palette);

png

this graph, while it gives us the powerful insight of relatively more people leaving having done less than 3 projects or more than 5, doesn’t really give us much perspective. We wish to give such numbers more context, for example by taking into account the number of years an employee has worked at the company. 6 projects in 2 year is a whole different matter than 2 project in 6 years!

sns.factorplot("left", col="company_years",hue="number_project", col_wrap=4, data=hra, kind="count",size=4);

png

This one takes a moment to read. In chart i, we have the count of employees which spent i years at the company, divided into two set of bars depending on whether they left or didn’t. Each set of bars is colored with respect to the number of projects, referring to their order makes for a much easier read when looking at a single set whereas their colouring makes it fast to compare different charts.

A few interesting pattern emerges here. The second row of charts is quite useless since, as noted earlier, very few people leave after having spent 6 years at the company (and very few people seems to stay as well, but still enough to be able to make some assumptions). The first one is much more interesting.

We can easily notice how the left set of bars of each chart has one/two different coloured bar/bars raising above all others. After the 3rd year, employees who only did 2 projects are more likely to leave. After the 4th, employees who did 6 to 7 projects are likely to leave (overworking?). After the 5th year, employees who did 4 to 5 projects are more likely. These are interesting observation even though at a first glance there doesn’t seem to be an obvious pattern going on, since we don’t get any sense of linearity from them. We observe the risky number of project as 2, then 6 / 7, then 4 / 5 which doesn’t seem to make much sense.

Let’s keep in mind that even observation who doesn’t quite make sense on their own could prove themselves useful once combined with other features or insights.

Conclusions

It was an interesting dataset. We were able to use our tools to discover some interesting patterns which could be a good starting point to further analyse the relationship and the correlations inside the dataset. We also explored some of the others features contained in the dataset but they didn’t seem to contain much information target-wise. This kind of discovery is also an accomplishment as it might help us cut out some of the noise to better focus on where the real information lies.

We were lucky enough to find a dataset which was moderately accessible as the context was of rather easy understanding, which allowed us to make some useful observation about which features were worth exploring first, which others could provide us with useful insight if combined, and so on. There are many other cases were the context and therefore the data isn’t this clear or this approachable and a lot more time must be spent to properly understand what we’re managing.

Obviously it helps to know what we’re dealing with and what we’re looking for even if nowadays tools are moving toward an approach that automatises even some of this processes. Some Machine Learning algorithm can be used to provide useful insights about data correlations but we plan to discuss that in a further post.

Our goal was to showcase the power of data visualization and we hope we satisfied that goal by providing a few valuable examples of how the way we shape and look at data makes all the difference in the world by turning a huge amount of numbers, strings and bytes into a better understanding of our world. If that’s not fascinating, I don’t know what is.