Tag: supervised

Classification Models – Employee attrition

Modeling for prediction

In order to find a model which could help with the prediction process we ran several data mining models



From the previous results its clear that decision tree stole the show!

However lets think practically

  • It is often required to explain the business why we think a person could leave, in this case we need a model whose output we can explain. In our case a decision tree or logistic regression
  • Sometimes HR would just like to run our model on random data sets , so its not always possible to Balance our datasets using techniques like smote
  • Our model should just be able to predict better than random but imagine the cost of entertaining an employee who was not going to leave but our system tagged him – This is a future improvement for our model
  • XGBoost model created a nice ensemble of trees for us, whose accuracy could increase more than the decision tree if we get more data


We successfully created an early warning system  which immediately tells the Human Resources department if an employee is prune to leave or not.

We achieved this early warning system based on several data mining techniques in order to be  very accurate on supervised classification modelling


EDA and Data Cleaning

Well the data is here

So we first start with EDA

  • Data is imbalance by class we have 83% who have not left the company and 17% who have left the company
  • The age group of IBM employees in this data set is concentrated between 25-45 years
  • Attrition is more common in the younger age groups and it is more likely with females As Expected it is more common amongst single Employees
  • People who leave the company get lower opportunities to travel the company
  • People having very high education tend to have lower attrition
  • The correlation plot was as expected
  • Link to eda workbook in python is here
  • From the Tableau plots we can conclude that below mentioned category are having higher attrition rate:
    • Sales department among all the departments
    • Human Resources and Technical Degree in Education
    • Single’s in Marital status (Will not use this due to GDPR)
    • Male in comparison to females in Gender (Will not use this due to GDPR)
    • Employee with job satisfaction value 1
    • Job level 1 in job level
    • Life balance having value 1
    • Employee staying at distant place
    • Environment Satisfaction value 1


First of all we have categorical data and if we want to run machine learning algorithms in python we need to be able to convert categorical variables(nominal) to dummy variables and ordinal ones to integer values.

Once we are done with that we need to embrace the fact that our data is biased so in order to equalize the class balance we make use of the Synthetic minority oversampling technique (SMOTE). You can google about it.

The code file is located here for your reference ->   https://github.com/mmd52/3XDataMining/blob/master/DataCleaning_And_Smote.ipynb

Linear Regression using Tensor Flow

The best thing to do when starting something new is to start doing something simple.

In our case lets do linear regression in which we will try to predict the price of a house with its size. Yes we will use some falsified data but that’s fine.

Well first things first, every thing in tensor flow is in the form of an array, so we begin initialising our data as arrays


Okay so we have area and prices that is our x and y both in the form of a numpy array.

Now the next step is a very crucial step, in this we will determine

  1. Number of iterations
  2. Learning rate
  3. Cost Function

Why the above 3 steps? well we do it to find the smallest error. We make use of Gradient Descent

learning_rate = 0.01
training_epochs = 1000
cost_history = np.empty(shape=[1],dtype=float)

X = tf.placeholder(tf.float32,[None,n_dim])
Y = tf.placeholder(tf.float32,[None,1])
W = tf.Variable(tf.ones([n_dim,1]))

init = tf.initialize_all_variables()

y_ = tf.matmul(X, W)
cost = tf.reduce_mean(tf.square(y_ - Y))
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Now an important thing to note here is that nothing here was actually executed. Tensor flow objects are only executed when they are explicitly called. So we need to explicitly call it. Till then we need to define place holder for variables that will be a part of it.

So for example we need x,y and w for


Finally let us execute tensor flow

sess = tf.Session()

for epoch in range(training_epochs):
 cost_history = np.append(cost_history,
         sess.run(cost,feed_dict={X: train_x,Y: train_y}))

This will actually train the model and find the cost function.

You can find the code for this on git hub here.

If you are looking for something with a bigger data set , you can find the code for regression on the Boston data set using tensor flow here


Running Various Models on Pima Indian Diabetesdata set

EDA was done various inferences found , now we will run various models and verify whether predictions match with the inferences.

As I have mentioned in the previous post , my focus is on the code and inference , which you can find in the python notebooks or R files.

Model Accuracy Precision Recall Kappa AUC
Decion Tree 73.48 75.33 82.48 0.4368 0.727
Naïve Bayes 75.22 82 80.39 0.4489 0.723
KNN 73.91 86.67 76.47 0.3894 0.683
Logistic Regression 76.09 82.67 81.05 0.4683 0.732
SVM Simple 73.91 86.67 76.47 0.3894 0.683
SVM 10 Folds 73.04 82.67 77.5 0.388 0.6883
SVM Linear 10 Folds 78.26 88.67 80.12 0.4974 0.7371
Random Forest 76.52 84 80.77 0.4733 0.733
XGBOOST 77.83 91.61 77.06 0.4981 0.843
Model Accuracy Precision Recall Kappa AUC
Decion Tree 72.73 73 73 0.388 0.7
Naïve Bayes 80.51 80 81 0.5689 0.78
KNN 70.99 70 71 0.337 0.66
Logistic Regression 74.45 74 74 0.3956 0.68
SVM Simple 73.16 73 73 0.4007 0.69
Random Forest 76.62 77 77 0.48 0.73
XGBOOST 79.22 79 79 0.526 0.76

As we can see from the above tables XGBOOST was the clear winner for both the languages.

The Code for Python you can find at -> https://github.com/mmd52/Pima_Python

The code for R you can find at -> https://github.com/mmd52/Pima_R

Decision Tree and Interpretation on Telecom Data

We saw that logistic Regression was a bad model for our telecom churn analysis, that leaves us with Decision tree.

Again we have two data sets the original data and the over sampled data. We run decision tree model on both of them and compare our results.

So running decision tree on the normal data set yielded better results as compared to running on the over sampled data set

Accuracy Kappa Precision Recall Auc
Data 0.9482 0.7513 0.68421 0.90698 0.837
Over Sampled Data 0.8894 0.5274 0.5965 0.5862 0.7656

Unfortunately the decision tree plot was too big for me to put it in this post.

As decision tree is giving the highest level of accuracy , we will select it as the clear winner for our telecom churn analysis problem.

Another major advantage of decision tree is that it could be explained graphically very easily to the end business user on why a particular choice is being made.

You can find the code for decision tree here->


This was a dummy database and may not have yielded the best results , but is a perfect exercise for practice.

Feature Engineering On Telecom Data

Although the Telecom data provided by https://www.sgi.com/tech/mlc/db/ has no missing values , there is a landslide of class imbalance.

That is why the only thing we will concentrate in our feature engineering is eliminating class imbalance.

> summary(train$Customer_Left)
False True 
 2850 483

Its Visible that retained customers in our training set is 2850 and customer who left are 483. Because of this I will do oversampling on the customers who left to balance the data set.

Let us assume that I do not over sample , then by even not making any model I can simply say customer retained and still be right 85.8% of the time. In order to break this bias I use a package known as SMOTE(Synthetic minority oversampling technique ) you can read about the research paper published in the Journal of Artificial Intelligence Research 16 (2002) here -> https://www.jair.org/media/953/live-953-2037-jair.pdf

> train$Customer_Left<-as.numeric(train$Customer_Left)
> summary(as.factor(train$Customer_Left))
 1 2 
 483 2850 
> train$Customer_Left[train$Customer_Left==2]<-0
> summary(as.factor(train$Customer_Left))
 0 1 
2850 483  
#here false ->1
# true ->0
> train$Customer_Left<-as.factor(train$Customer_Left)
> ntrain<-SMOTE(Customer_Left~.,train,perc.over=200,k = 3)
> ntrain$Customer_Left<-as.factor(ntrain$Customer_Left)
> summary(ntrain$Customer_Left)
 0 1 
1932 1449

Now we have simply under sampled retained customers from 2850 to 1932 and over sampled customers who left the operator from 483 to 1449.

Now train has been manipulated hence I also had to manipulate test once.

You can see the complete code here ->  https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/FE.R

NOTE-> The libararies.R file consists code that loads packages needed and if they are not installed on your machine it will download and then install them.

Churn Analysis On Telecom Data

One of the major problems that telecom operators face is customer retention. Because of which majority of the Telecom operators want to know which customer is most likely to leave them, so that they could immediately take certain actions like providing a discount or providing a customised plan, so that they could retain the customer.

However accuracy required while building a churn analysis model needs to be very high, imagine if our model has a accuracy of just 75% and the total number of customers who want to leave are just 5% , this leaves a margin of 20% of customers who were wrongly classified as customers who will leave the operator. If an operator has 10000 customers,And 2500 customers are predicted to leave , the operator may have to release lets assume a 1$ credit to all that’s a cost of 2500$, where as credits that required to be released was only for 5% of the customer’s that is a cost of 500$, hence the operator spent 2000$ for no reason. If the operator has high number of customers it would lead to a huge loss.

Coming to the data quotient, there is no freely available telecom data as far as I know available, however the website https://www.sgi.com/tech/mlc/db/ provides data for churn analysis, this data is not real but represents real world scenarios and is good from the perspective of understanding and learning.

The data on the website is classified into train and test has no NA’s means no feature engineering as such to be done before running models on it.

Now comes the question of which models to run on it. Some would say since we need very high accuracy hence we will run xgboost or random forest, however the downside we have here is that we cannot explain to the operator on what basis is XGBOOST or random forest determining why will the customer leave him. Even if we manage to explain its very complicated and will not be accepted.

Because of this we will have to take support on models that can be easily explained to the customer. This leaves us with two models for classification .i.e. customer leaves -> 0 or customer is retained -> 1. So the models are Logistic regression and decision tree.

Why Logistic Regression ?  well because we can explain to the operator why customer is leaving him thanks to the logit equation.

Why Decision Tree? well because there is a neat flow of how our tree makes decision by breaking variables and deciding yes and no based on entropy and impurity.

Further in this post category I will show feature engineering to Running models, to interpretation.

The data available from the website is a bit complex to save to a CSV file so if you need you can download the train and test data from below.

Also explanation of variables is not provided as it is fairly simple.


Paper On Using Various Classifier Algorithms and Scaling Up Accuracy Of A Model


Revised Approach To UCI ADULT DATA SET

If you have seen the posts in the uci adult data set section, you may have realised I am not going above 86% with accuracy.

An important thing I learnt the hard way was to never eliminate rows in a data set. Its fine to eliminate columns having NA values above 30% but never eliminate rows.

Because of this I had to redo my feature engineering. So how to fix my missing NA values , well what i did was , I opened my data set in excel and converted all ‘?’ mark values to ‘NA’

This would make feature engineering more simple. The next step is to identify columns with missing values, and see if their missing values were greater than 30% in totality.

In our case type_employer had 1836 missing values

occupation had a further 1843 missing values

and country had 583 missing values.

So what I did was , I predicted the missing values with the help of other independent variables(No I didnt add income here for predicting them). Once my model was made i used it to replace the missing values in the columns. Thus i had a clean data set with no missing values.

I admit the predictions were not that great , but they were tolerable.

Because of which when I ran the following models my accuracy skyrocketed

  1. Logistic Regression -> 85.38%
  2. Random Forest(Excluding variable country)  -> 87.11%
  3. SVM -> 85.8%
  4. XGBOOST with 10 folds -> 87.08%

Continue reading “Revised Approach To UCI ADULT DATA SET”

Stacking on Numeric Data Sets

As is human nature we always want to get a better prediction , if possible some would pray for a full 100%.

Anyways ignoring the Hypothetical, We have run a number of common models like

1)Logistic Regression

2)Random Forest


so now the question arises, whether we can give it a tad bit push for a better accuracy?

Continue reading “Stacking on Numeric Data Sets”