Tag: machinelearning

IBM Employee HR Attrition

Its a new day, a client walks in and says he needs your help.

Our client is ABC a leading firm and is doing well in the sector. It is recently facing a steep increase in its employee attrition . Employee attrition has gone up from 14% to 25% in the last 1 year . We are asked to prepare a strategy to immediately tackle this issue such that the firm’s business is not hampered and also to propose an efficient employee satisfaction program for long run. Currently, no such program is in place . Further salary hikes are not an option.

data is here

Well this is a nice business problem, so lets do some more research on it – >

The attrition problem is not only unique to ABC but to other IT companies such as XYZ, India’s second largest IT services company, that is also battling high attrition, with a peak attrition of 20.4 % in the October-December quarter of FY15.

Now that we know the market situation what can we do ?

This slideshow requires JavaScript.


From this decision tree it should be clear that we will create an early warning system to help the company identify those employees which are more probable to leave the company.

In the following posts we will go through

  1. EDA
  2. Data cleansing
  3. Classification models


But why is a company so affected by employee attrition

  • Cost of training a new employee
  • cost of acquiring a new employee
  • But most importantly an employee is a asset that adds value to a company, and when an employee leaves a value percentage of the company is diminished with it, at the end a company spends an enormous sum trying to replace this employee and recreating the value it lost.

Cash Forecasting – Overview

How do ATM’s in general work is a great question to ask? Well banks at times prefer not to manage their ATM’s as it involves a lot of overhead such as transportation of cash, maintenance of ATM machines, rent and most importantly security.

In order to avoid this over head a lot of banks outsource this task. The companies who overtake this responsibility , make their revenue based on every transaction made. Say for every non cash transaction from the ATM managed by them they get x$ and for every cash transaction they get y$  where y>x .

So why do we need to predict cash ?? well these companies rent a place, put their ATM’s at that place keep a service engineer to maintain that machine and pump enough security, but where they need to be careful is interest cost. What interest cost? lets say for today’s date I decided to keep 100$ in my ATM, I would borrow this money from a bank, to whom I would pay interest every day for the cash that is not withdrawn by the customer’s.

The obvious solution for this is to load ATM’s with the smallest amount of money possible, however this leads to two problems, First is loss of revenue from a potential customer, and second one is brand loss, and brand loss is very bad.

That means we do not want to load to much money to avoid paying interest cost on idle money, and neither do we want to put to less in order to avoid loss of revenue and brand loss. In order to find this perfect balance we need to create a forecasting model on how much money to load in the ATM’s, in order to make the business profitable.

One underlying constraint is transportation. We cannot transport and load money in ATM’s on a daily basis to avoid transportation costs, that is why transportation will happen only once in two to three days.

Feature Engineering On Telecom Data

Although the Telecom data provided by https://www.sgi.com/tech/mlc/db/ has no missing values , there is a landslide of class imbalance.

That is why the only thing we will concentrate in our feature engineering is eliminating class imbalance.

> summary(train$Customer_Left)
False True 
 2850 483

Its Visible that retained customers in our training set is 2850 and customer who left are 483. Because of this I will do oversampling on the customers who left to balance the data set.

Let us assume that I do not over sample , then by even not making any model I can simply say customer retained and still be right 85.8% of the time. In order to break this bias I use a package known as SMOTE(Synthetic minority oversampling technique ) you can read about the research paper published in the Journal of Artificial Intelligence Research 16 (2002) here -> https://www.jair.org/media/953/live-953-2037-jair.pdf

> train$Customer_Left<-as.numeric(train$Customer_Left)
> summary(as.factor(train$Customer_Left))
 1 2 
 483 2850 
> train$Customer_Left[train$Customer_Left==2]<-0
> summary(as.factor(train$Customer_Left))
 0 1 
2850 483  
#here false ->1
# true ->0
> train$Customer_Left<-as.factor(train$Customer_Left)
> ntrain<-SMOTE(Customer_Left~.,train,perc.over=200,k = 3)
> ntrain$Customer_Left<-as.factor(ntrain$Customer_Left)
> summary(ntrain$Customer_Left)
 0 1 
1932 1449

Now we have simply under sampled retained customers from 2850 to 1932 and over sampled customers who left the operator from 483 to 1449.

Now train has been manipulated hence I also had to manipulate test once.

You can see the complete code here ->  https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/FE.R

NOTE-> The libararies.R file consists code that loads packages needed and if they are not installed on your machine it will download and then install them.

Churn Analysis On Telecom Data

One of the major problems that telecom operators face is customer retention. Because of which majority of the Telecom operators want to know which customer is most likely to leave them, so that they could immediately take certain actions like providing a discount or providing a customised plan, so that they could retain the customer.

However accuracy required while building a churn analysis model needs to be very high, imagine if our model has a accuracy of just 75% and the total number of customers who want to leave are just 5% , this leaves a margin of 20% of customers who were wrongly classified as customers who will leave the operator. If an operator has 10000 customers,And 2500 customers are predicted to leave , the operator may have to release lets assume a 1$ credit to all that’s a cost of 2500$, where as credits that required to be released was only for 5% of the customer’s that is a cost of 500$, hence the operator spent 2000$ for no reason. If the operator has high number of customers it would lead to a huge loss.

Coming to the data quotient, there is no freely available telecom data as far as I know available, however the website https://www.sgi.com/tech/mlc/db/ provides data for churn analysis, this data is not real but represents real world scenarios and is good from the perspective of understanding and learning.

The data on the website is classified into train and test has no NA’s means no feature engineering as such to be done before running models on it.

Now comes the question of which models to run on it. Some would say since we need very high accuracy hence we will run xgboost or random forest, however the downside we have here is that we cannot explain to the operator on what basis is XGBOOST or random forest determining why will the customer leave him. Even if we manage to explain its very complicated and will not be accepted.

Because of this we will have to take support on models that can be easily explained to the customer. This leaves us with two models for classification .i.e. customer leaves -> 0 or customer is retained -> 1. So the models are Logistic regression and decision tree.

Why Logistic Regression ?  well because we can explain to the operator why customer is leaving him thanks to the logit equation.

Why Decision Tree? well because there is a neat flow of how our tree makes decision by breaking variables and deciding yes and no based on entropy and impurity.

Further in this post category I will show feature engineering to Running models, to interpretation.

The data available from the website is a bit complex to save to a CSV file so if you need you can download the train and test data from below.

Also explanation of variables is not provided as it is fairly simple.


Paper On Using Various Classifier Algorithms and Scaling Up Accuracy Of A Model


Revised Approach To UCI ADULT DATA SET

If you have seen the posts in the uci adult data set section, you may have realised I am not going above 86% with accuracy.

An important thing I learnt the hard way was to never eliminate rows in a data set. Its fine to eliminate columns having NA values above 30% but never eliminate rows.

Because of this I had to redo my feature engineering. So how to fix my missing NA values , well what i did was , I opened my data set in excel and converted all ‘?’ mark values to ‘NA’

This would make feature engineering more simple. The next step is to identify columns with missing values, and see if their missing values were greater than 30% in totality.

In our case type_employer had 1836 missing values

occupation had a further 1843 missing values

and country had 583 missing values.

So what I did was , I predicted the missing values with the help of other independent variables(No I didnt add income here for predicting them). Once my model was made i used it to replace the missing values in the columns. Thus i had a clean data set with no missing values.

I admit the predictions were not that great , but they were tolerable.

Because of which when I ran the following models my accuracy skyrocketed

  1. Logistic Regression -> 85.38%
  2. Random Forest(Excluding variable country)  -> 87.11%
  3. SVM -> 85.8%
  4. XGBOOST with 10 folds -> 87.08%

Continue reading “Revised Approach To UCI ADULT DATA SET”

Support Vector Machine

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyper-plane. In other words, given labelled training data (supervised learning), the algorithm outputs an optimal hyper-plane which categorises new examples.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

When data are not labelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The clustering algorithm which provides an improvement to the support vector machines is called support vector clustering and is often used in industrial applications either when data are not labelled or when only some data are labelled as a preprocessing for a classification pass.

SVM can do some amazing predictions , for example when you use tune function specify a range of cost and epsilon values. The tune function automatically picks up the best SVM model for us with the least possible error.

Support vector machine algorithms can be very computational intensive and in our case the are with the large number of data rows. It took my machine 10 hours to process the model completely.


Extreme Gradient Boosting

The term ‘Boosting’ refers to a family of algorithms which converts weak learner’s to strong learners.

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

  1. Email has only one image file (promotional image), It’s a SPAM
  2. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
  3. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learners.To convert a weak learner to a strong learner, we’ll combine the prediction of each weak learner to form one definitive strong learner.

Continue reading “Extreme Gradient Boosting”

Stacking on Numeric Data Sets

As is human nature we always want to get a better prediction , if possible some would pray for a full 100%.

Anyways ignoring the Hypothetical, We have run a number of common models like

1)Logistic Regression

2)Random Forest


so now the question arises, whether we can give it a tad bit push for a better accuracy?

Continue reading “Stacking on Numeric Data Sets”

Logistic Regression,Random Forest,SVM on Numerical Data Set

So its been a long time. We have finally got the data just as how we want it.

Great so data is ready and we already have a bit of knowledge on logistic Regression and Random Forest.

So going ahead first with Logistic Regression-


on executing this magic line I lie with an accuracy of 80% . Naaaaah , not what we wanted.

so going ahead with Random Forest

bestmtry <- tuneRF(training_data[,-14], as.factor(training_data[,14]),
ntreeTry=100, stepFactor=1.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
rf.fit <- randomForest(income ~ ., data=training_data,
mtry=4, ntree=1000, keep.forest=TRUE, importance=TRUE, test=x_test)

Yes !!

this returned finally an 86 % , it looks like we are doing great. We finally did it!!!!!!

Trying out SVM now.

But wait what is SVM- Support vector machines?

Think of all the data points plotted in space that we cant visualise.But imagine if we had 2D data, then in very vague terms SVM would make lines for us that would help us clearly classify whether a data point belongs to the group 50K and above or 50K and below.

So SVM has hyperplanes these planes are calculated in such a way that they are equidistant from both the classes.

In SVM a plane with maximum margin is a good plane and a plane with minimum margin is a bad plane.

With that said you can find the code for random forest and logistic regression here ->


and for svm here->

SVM code