Tag: datascience

Data – Cash Forecasting

Now the data we are talking about is usually highly confidential and one of the major reasons why we will be working with dummy data.

The data we have is that of a single ATM, for various time periods.

Our fields are Holiday (Binary) where 1 indicates an holiday and 0 indicates a normal working day.

Up time defines how long was the ATM up. At times ATM’s could not be functional because of power outage, network connectivity or physical issues.

Peak period (Binary) where 1 indicates peak period and 0 indicates non peak period.

Dispense cash is dispensed cash against a particular day.

This slideshow requires JavaScript.

Now in general for a normal ATM the weekly trend would be a spike in dispense on the start of weeks mostly Monday and a drop during end of the week ie Saturdays and Sundays.

For a monthly trend the spike would be at the beginning and end of the month and a drop somewhere in between. And this seems logical to, think of salaried people as an example, they get their salaries at the end of the month and probably use it to plan their month ahead, or business men would like to pay their dues or salaries to their employees at the end of the month and therefore withdraw cash.

Above you can see a few basic insights regarding cash dispense.

You can download the dummy data set that I have created from here

Advertisements

Cash Forecasting – Understanding difficulty associated with

Every ATM that is placed anywhere would have different dispensation trends then the other. An ATM in a rural area would have a different and smaller dispensation trend as compared to a busy suburban area. This means a different model for every ATM. WAIT WHAT?? So much over head, no way anyways doing that right. Well yes no ones going to do that.

Solution to this problem is classifying ATM’s based on their locality, Imagine creating various bands, where Band1 is a metropolitan area where users dispense cash frequently from and band5 being the lowest where cash dispense is the lowest.

Now that solves a very minor problem, but still if we see cash dispense trends they would waiver in spite of being in the same band. HMM.. problem still not solved. Now we categorise ATM’s based on their age and on average how much they dispense.

So for our case study we will only be considering ATM’s that have an age greater than 6 months and on average dispense between 0 – 100,000 $.

Cash Forecasting – Overview

How do ATM’s in general work is a great question to ask? Well banks at times prefer not to manage their ATM’s as it involves a lot of overhead such as transportation of cash, maintenance of ATM machines, rent and most importantly security.

In order to avoid this over head a lot of banks outsource this task. The companies who overtake this responsibility , make their revenue based on every transaction made. Say for every non cash transaction from the ATM managed by them they get x$ and for every cash transaction they get y$  where y>x .

So why do we need to predict cash ?? well these companies rent a place, put their ATM’s at that place keep a service engineer to maintain that machine and pump enough security, but where they need to be careful is interest cost. What interest cost? lets say for today’s date I decided to keep 100$ in my ATM, I would borrow this money from a bank, to whom I would pay interest every day for the cash that is not withdrawn by the customer’s.

The obvious solution for this is to load ATM’s with the smallest amount of money possible, however this leads to two problems, First is loss of revenue from a potential customer, and second one is brand loss, and brand loss is very bad.

That means we do not want to load to much money to avoid paying interest cost on idle money, and neither do we want to put to less in order to avoid loss of revenue and brand loss. In order to find this perfect balance we need to create a forecasting model on how much money to load in the ATM’s, in order to make the business profitable.

One underlying constraint is transportation. We cannot transport and load money in ATM’s on a daily basis to avoid transportation costs, that is why transportation will happen only once in two to three days.

Decision Tree and Interpretation on Telecom Data

We saw that logistic Regression was a bad model for our telecom churn analysis, that leaves us with Decision tree.

Again we have two data sets the original data and the over sampled data. We run decision tree model on both of them and compare our results.

So running decision tree on the normal data set yielded better results as compared to running on the over sampled data set

Accuracy Kappa Precision Recall Auc
Data 0.9482 0.7513 0.68421 0.90698 0.837
Over Sampled Data 0.8894 0.5274 0.5965 0.5862 0.7656

Unfortunately the decision tree plot was too big for me to put it in this post.

As decision tree is giving the highest level of accuracy , we will select it as the clear winner for our telecom churn analysis problem.

Another major advantage of decision tree is that it could be explained graphically very easily to the end business user on why a particular choice is being made.

You can find the code for decision tree here->

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Decision_Tree.R

This was a dummy database and may not have yielded the best results , but is a perfect exercise for practice.

Determining Feature Importance For Telecom Data

We have a complete data set  -> Check

Feature engineering done -> Check

How many variables do we have?    20 variables

How many should we ideally use ?   Not more that 10 ideally

How to determine which variables to include and which not to ?   Its simple do Boruta!!

Whats Boruta?

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. You can read about it here ->  https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/   Analytics vidhya has given a pretty good explanation about it here.

Now keep one important thing in mind we have two train sets 1)Normal train set  2)Smote Train set.

So upon running boruta on the normal train set, boruta confirmed the variables International_Plan,Voice_Mail_Plan ,No_Vmail_Messages,Total_Day_minutes,
Total_Day_charge,Total_Eve_Minutes,Total_Eve_Charge,Total_Night_Minutes,
Total_Night_Charge , Total_Intl_Minutes,Total_Intl_Calls,Total_Intl_Charge,
No_CS_Calls as important.

And upon running Boruta on Smote data set, Boruta confirmed all the variables as important, you can find the boruta code below

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Boruta_Imp_FE.R

Churn Analysis On Telecom Data

One of the major problems that telecom operators face is customer retention. Because of which majority of the Telecom operators want to know which customer is most likely to leave them, so that they could immediately take certain actions like providing a discount or providing a customised plan, so that they could retain the customer.

However accuracy required while building a churn analysis model needs to be very high, imagine if our model has a accuracy of just 75% and the total number of customers who want to leave are just 5% , this leaves a margin of 20% of customers who were wrongly classified as customers who will leave the operator. If an operator has 10000 customers,And 2500 customers are predicted to leave , the operator may have to release lets assume a 1$ credit to all that’s a cost of 2500$, where as credits that required to be released was only for 5% of the customer’s that is a cost of 500$, hence the operator spent 2000$ for no reason. If the operator has high number of customers it would lead to a huge loss.

Coming to the data quotient, there is no freely available telecom data as far as I know available, however the website https://www.sgi.com/tech/mlc/db/ provides data for churn analysis, this data is not real but represents real world scenarios and is good from the perspective of understanding and learning.

The data on the website is classified into train and test has no NA’s means no feature engineering as such to be done before running models on it.

Now comes the question of which models to run on it. Some would say since we need very high accuracy hence we will run xgboost or random forest, however the downside we have here is that we cannot explain to the operator on what basis is XGBOOST or random forest determining why will the customer leave him. Even if we manage to explain its very complicated and will not be accepted.

Because of this we will have to take support on models that can be easily explained to the customer. This leaves us with two models for classification .i.e. customer leaves -> 0 or customer is retained -> 1. So the models are Logistic regression and decision tree.

Why Logistic Regression ?  well because we can explain to the operator why customer is leaving him thanks to the logit equation.

Why Decision Tree? well because there is a neat flow of how our tree makes decision by breaking variables and deciding yes and no based on entropy and impurity.

Further in this post category I will show feature engineering to Running models, to interpretation.

The data available from the website is a bit complex to save to a CSV file so if you need you can download the train and test data from below.

Also explanation of variables is not provided as it is fairly simple.

https://github.com/mmd52/Telecom_Churn_Analysis

Paper On Using Various Classifier Algorithms and Scaling Up Accuracy Of A Model

machine-learning-on-uci-adult-data-set-using-various-classifier-algorithms-and-scaling-up-the-accuracy-using-extreme-gradient-boosting

Extreme Gradient Boosting

The term ‘Boosting’ refers to a family of algorithms which converts weak learner’s to strong learners.

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

  1. Email has only one image file (promotional image), It’s a SPAM
  2. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
  3. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learners.To convert a weak learner to a strong learner, we’ll combine the prediction of each weak learner to form one definitive strong learner.

Continue reading “Extreme Gradient Boosting”

Logistic Regression,Random Forest,SVM on Numerical Data Set

So its been a long time. We have finally got the data just as how we want it.

Great so data is ready and we already have a bit of knowledge on logistic Regression and Random Forest.

So going ahead first with Logistic Regression-

logit.fit=glm(income~.,family=binomial(logit),data=training_data)

on executing this magic line I lie with an accuracy of 80% . Naaaaah , not what we wanted.

so going ahead with Random Forest

bestmtry <- tuneRF(training_data[,-14], as.factor(training_data[,14]),
ntreeTry=100, stepFactor=1.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
rf.fit <- randomForest(income ~ ., data=training_data,
mtry=4, ntree=1000, keep.forest=TRUE, importance=TRUE, test=x_test)

Yes !!

this returned finally an 86 % , it looks like we are doing great. We finally did it!!!!!!

Trying out SVM now.

But wait what is SVM- Support vector machines?

Think of all the data points plotted in space that we cant visualise.But imagine if we had 2D data, then in very vague terms SVM would make lines for us that would help us clearly classify whether a data point belongs to the group 50K and above or 50K and below.

So SVM has hyperplanes these planes are calculated in such a way that they are equidistant from both the classes.

In SVM a plane with maximum margin is a good plane and a plane with minimum margin is a bad plane.

With that said you can find the code for random forest and logistic regression here ->

code

and for svm here->

SVM code

 

Feature Engineering / Data Pre-Processing Code walk Through

# @Author Mohammed 25-12-2016
source(“Libraries.r”)

#Downloading adult income data set from UCI
print(“=================Downloading Data=========================”)

data = read.table(“http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data&#8221;,
sep=”,”,header=F,
col.names=c(“age”, “type_employer”, “fnlwgt”, “education”,
“education_num”,”marital”, “occupation”, “relationship”, “race”,”sex”,
“capital_gain”, “capital_loss”, “hr_per_week”,”country”, “income”),
fill=FALSE,strip.white=T)

print(“=====================Data Loaded===============================”)

#Data Dictionary
View(head(data,20))

#So first things first lets understand each column
#1)Age-> Like i need to explian :p (continuous).
#2)workclass-> Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
#3)fnlwgt-> continuous (have not quite understood this).
#4)education-> Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
#5)education-num->continuous (A numeruic representation of education).
#6)marital-status-> Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
#7)occupation-> Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
#8)relationship-> Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
#9)race-> White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
#10)sex-> Female, Male.
#11)capital-gain-> continuous.
#12)capital-loss-> continuous.
#13)hours-per-week-> continuous.
#14)native-country-> United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
#As we can clearly see from above education num is a numeric
#representation of education so we will go ahead and delete it
data[[“education_num”]]=NULL
########### Binning
#What we will do here is convert our raw numeric fields into bins
#wel do this for the ones it makes sense like age
#=========================binning for Age==================
#Age<18=child 18<Age<=30=Young Adult
#30<Age<=60=adult Age>60=Senior
ndata<-data
for (i in 1:nrow(ndata)){
if(ndata[i,1]<=18){
ndata[i,1]=”child”
}else if (ndata[i,1]>18 && ndata[i,1]<=30){
ndata[i,1]=”young_adult”
}else if (ndata[i,1]>30 && ndata[i,1]<=60){
ndata[i,1]=”adult”
}else if (ndata[i,1]>61 ){
ndata[i,1]=”senior”
}
}

#=========================binning for Hours worked per week=======
#HW<=25=Part Time 25<HW<=40=Full Time
#40<HW<=60=Over_Time HW>61=Time To Much
npdata<-ndata
for (i in 1:nrow(npdata)){
if(npdata[i,12]<=25){
npdata[i,12]=”Part_Time”
}else if (npdata[i,12]>25 && npdata[i,11]<=40){
npdata[i,12]=”Full_Time”
}else if (npdata[i,12]>40 && npdata[i,11]<=60){
npdata[i,12]=”Over_Time”
}else if (npdata[i,12]>61 ){
npdata[i,12]=”TIME_TOMUCH”
}
}

data<-npdata

######################### Factor to Character
#Convert factor variables to character
#We do this for ease of processing
fctr.cols <- sapply(data, is.factor)
data[, fctr.cols] <- sapply(data[, fctr.cols], as.character)
######################## Missing Value Treatment

is.na(data) = data==’?’
is.na(data) = data==’ ?’
data = na.omit(data)

#######################################################################
train_test<-data

#Now in Employer Type we have so many , options more complexity right
#So lets make just reduce them by using func GSUB
#So now my complexity is reduced
train_test$type_employer = gsub(“^Federal-gov”,”Federal-Govt”,train_test$type_employer)
train_test$type_employer = gsub(“^Local-gov”,”Other-Govt”,train_test$type_employer)
train_test$type_employer = gsub(“^State-gov”,”Other-Govt”,train_test$type_employer)
train_test$type_employer = gsub(“^Private”,”Private”,train_test$type_employer)
train_test$type_employer = gsub(“^Self-emp-inc”,”Self-Employed”,train_test$type_employer)
train_test$type_employer = gsub(“^Self-emp-not-inc”,”Self-Employed”,train_test$type_employer)
train_test$type_employer = gsub(“^Without-pay”,”Not-Working”,train_test$type_employer)
train_test$type_employer = gsub(“^Never-worked”,”Not-Working”,train_test$type_employer)

#Similarly here
train_test$occupation = gsub(“^Adm-clerical”,”Admin”,train_test$occupation)
train_test$occupation = gsub(“^Armed-Forces”,”Military”,train_test$occupation)
train_test$occupation = gsub(“^Craft-repair”,”Blue-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Exec-managerial”,”White-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Farming-fishing”,”Blue-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Handlers-cleaners”,”Blue-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Machine-op-inspct”,”Blue-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Other-service”,”Service”,train_test$occupation)
train_test$occupation = gsub(“^Priv-house-serv”,”Service”,train_test$occupation)
train_test$occupation = gsub(“^Prof-specialty”,”Professional”,train_test$occupation)
train_test$occupation = gsub(“^Protective-serv”,”Other-Occupations”,train_test$occupation)
train_test$occupation = gsub(“^Sales”,”Sales”,train_test$occupation)
train_test$occupation = gsub(“^Tech-support”,”Other-Occupations”,train_test$occupation)
train_test$occupation = gsub(“^Transport-moving”,”Blue-Collar”,train_test$occupation)

#You are right counrty too
train_test$country[train_test$country==”Cambodia”] = “SE-Asia”
train_test$country[train_test$country==”Canada”] = “British-Commonwealth”
train_test$country[train_test$country==”China”] = “China”
train_test$country[train_test$country==”Columbia”] = “South-America”
train_test$country[train_test$country==”Cuba”] = “Other”
train_test$country[train_test$country==”Dominican-Republic”] = “Latin-America”
train_test$country[train_test$country==”Ecuador”] = “South-America”
train_test$country[train_test$country==”El-Salvador”] = “South-America”
train_test$country[train_test$country==”England”] = “British-Commonwealth”
train_test$country[train_test$country==”France”] = “Euro_1″
train_test$country[train_test$country==”Germany”] = “Euro_1″
train_test$country[train_test$country==”Greece”] = “Euro_2″
train_test$country[train_test$country==”Guatemala”] = “Latin-America”
train_test$country[train_test$country==”Haiti”] = “Latin-America”
train_test$country[train_test$country==”Holand-Netherlands”] = “Euro_1″
train_test$country[train_test$country==”Honduras”] = “Latin-America”
train_test$country[train_test$country==”Hong”] = “China”
train_test$country[train_test$country==”Hungary”] = “Euro_2″
train_test$country[train_test$country==”India”] = “British-Commonwealth”
train_test$country[train_test$country==”Iran”] = “Other”
train_test$country[train_test$country==”Ireland”] = “British-Commonwealth”
train_test$country[train_test$country==”Italy”] = “Euro_1″
train_test$country[train_test$country==”Jamaica”] = “Latin-America”
train_test$country[train_test$country==”Japan”] = “Other”
train_test$country[train_test$country==”Laos”] = “SE-Asia”
train_test$country[train_test$country==”Mexico”] = “Latin-America”
train_test$country[train_test$country==”Nicaragua”] = “Latin-America”
train_test$country[train_test$country==”Outlying-US(Guam-USVI-etc)”] = “Latin-America”
train_test$country[train_test$country==”Peru”] = “South-America”
train_test$country[train_test$country==”Philippines”] = “SE-Asia”
train_test$country[train_test$country==”Poland”] = “Euro_2″
train_test$country[train_test$country==”Portugal”] = “Euro_2″
train_test$country[train_test$country==”Puerto-Rico”] = “Latin-America”
train_test$country[train_test$country==”Scotland”] = “British-Commonwealth”
train_test$country[train_test$country==”South”] = “Euro_2″
train_test$country[train_test$country==”Taiwan”] = “China”
train_test$country[train_test$country==”Thailand”] = “SE-Asia”
train_test$country[train_test$country==”Trinadad&Tobago”] = “Latin-America”
train_test$country[train_test$country==”United-States”] = “United-States”
train_test$country[train_test$country==”Vietnam”] = “SE-Asia”
train_test$country[train_test$country==”Yugoslavia”] = “Euro_2”
#Education is most important
train_test$education = gsub(“^10th”,”Dropout”,train_test$education)
train_test$education = gsub(“^11th”,”Dropout”,train_test$education)
train_test$education = gsub(“^12th”,”Dropout”,train_test$education)
train_test$education = gsub(“^1st-4th”,”Dropout”,train_test$education)
train_test$education = gsub(“^5th-6th”,”Dropout”,train_test$education)
train_test$education = gsub(“^7th-8th”,”Dropout”,train_test$education)
train_test$education = gsub(“^9th”,”Dropout”,train_test$education)
train_test$education = gsub(“^Assoc-acdm”,”Associates”,train_test$education)
train_test$education = gsub(“^Assoc-voc”,”Associates”,train_test$education)
train_test$education = gsub(“^Bachelors”,”Bachelors”,train_test$education)
train_test$education = gsub(“^Doctorate”,”Doctorate”,train_test$education)
train_test$education = gsub(“^HS-Grad”,”HS-Graduate”,train_test$education)
train_test$education = gsub(“^Masters”,”Masters”,train_test$education)
train_test$education = gsub(“^Preschool”,”Dropout”,train_test$education)
train_test$education = gsub(“^Prof-school”,”Prof-School”,train_test$education)
train_test$education = gsub(“^Some-college”,”HS-Graduate”,train_test$education)

# Similarly marital
train_test$marital[train_test$marital==”Never-married”] = “Never-Married”
train_test$marital[train_test$marital==”Married-AF-spouse”] = “Married”
train_test$marital[train_test$marital==”Married-civ-spouse”] = “Married”
train_test$marital[train_test$marital==”Married-spouse-absent”] = “Not-Married”
train_test$marital[train_test$marital==”Separated”] = “Not-Married”
train_test$marital[train_test$marital==”Divorced”] = “Not-Married”
train_test$marital[train_test$marital==”Widowed”] = “Widowed”

#Leaving race behind is racist no? 😛
train_test$race[train_test$race==”White”] = “White”
train_test$race[train_test$race==”Black”] = “Black”
train_test$race[train_test$race==”Amer-Indian-Eskimo”] = “Amer-Indian”
train_test$race[train_test$race==”Asian-Pac-Islander”] = “Asian”
train_test$race[train_test$race==”Other”] = “Other”

#Getting income below or above 50K to High and low
train_test$income[train_test$income==”>50K”]=”High”
train_test$income[train_test$income==”<=50K”]=”Low”

######################################################################
write.csv(x=train_test,file=”ADULT_USI_FE_CATEGORICAL.csv”)

######################################################################
#Doing Label Encoding
#This converts all categorical things to numeric
features = names(train_test[,-14])
for (f in features) {
if (class(train_test[[f]])==”character”) {
#cat(“VARIABLE : “,f,”\n”)
levels <- unique(train_test[[f]])
train_test[[f]] <- as.numeric(as.integer(factor(train_test[[f]], levels=levels)))
}
}
write.csv(x=train_test,file=”ADULT_USI_FE_Numerical.csv”)
print(“=====================Loading datasets complete.=================”)

 

You can find this project at->

https://github.com/mmd52/UCI_ADULT_DATSET_CATEGORICAL_PROJECT