Tag: engineering

Running Various Models on Pima Indian Diabetesdata set

EDA was done various inferences found , now we will run various models and verify whether predictions match with the inferences.

As I have mentioned in the previous post , my focus is on the code and inference , which you can find in the python notebooks or R files.

R
Model Accuracy Precision Recall Kappa AUC
Decion Tree 73.48 75.33 82.48 0.4368 0.727
Naïve Bayes 75.22 82 80.39 0.4489 0.723
KNN 73.91 86.67 76.47 0.3894 0.683
Logistic Regression 76.09 82.67 81.05 0.4683 0.732
SVM Simple 73.91 86.67 76.47 0.3894 0.683
SVM 10 Folds 73.04 82.67 77.5 0.388 0.6883
SVM Linear 10 Folds 78.26 88.67 80.12 0.4974 0.7371
Random Forest 76.52 84 80.77 0.4733 0.733
XGBOOST 77.83 91.61 77.06 0.4981 0.843
Python
Model Accuracy Precision Recall Kappa AUC
Decion Tree 72.73 73 73 0.388 0.7
Naïve Bayes 80.51 80 81 0.5689 0.78
KNN 70.99 70 71 0.337 0.66
Logistic Regression 74.45 74 74 0.3956 0.68
SVM Simple 73.16 73 73 0.4007 0.69
Random Forest 76.62 77 77 0.48 0.73
XGBOOST 79.22 79 79 0.526 0.76

As we can see from the above tables XGBOOST was the clear winner for both the languages.

The Code for Python you can find at -> https://github.com/mmd52/Pima_Python

The code for R you can find at -> https://github.com/mmd52/Pima_R

Advertisements

Models on UCI PIMA DataSet

The Idea behind using this data set from the UCI repository is not just running models, but deriving inferences that match to the real world.

This makes predictions we make all the more sensible and strong especially when we have understood the data set and have derived correct inferences from it which match our predictions.

Our approach to this data set will be to perform the following

  1. Exploratory data analysis while deriving inferences from it
  2. Using techniques like PCA and checking cor relationship between data
  3.  Running various models and making inferences from the predictions

We will do all of this in R , and in Python.

The data now provided by UCI ->

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Sc

Let us first begin to understand the problem first, and what better explain the problem then a short video which you can view from here -> https://youtu.be/pN4HqWRybwk

So from the video we understand that the PIMA Indian tribe has a gene which gets aggravated on eating food high with sugar. So UCI pima indian data set has a collection of data of females from the pima tribe. In the data set of 768 rows 268 of them have diabetes.

You can find the data set description here – > https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names

The problem statement is to correctly classify and predict if a female has diabetes or no. Thus its a classification Problem.

Good news for us is that the data set has no null or missing values and to top the cherry on our ice cream is completely numeric. Only the target variable outcome and pregnancies are factor variables. The remaining variables are continuous numeric variables.

Logistic Regression And Interpretation On Telecom Data

If you have read my previous posts, you may have understood how feature engineering was done and why we are running a logistic regression n this data.

It is essential to understand we have two train sets

  1. The original train set
  2. The over sampled train set

Running Logistic regression on the normal data set yielded the following results

#Accuracy 87.53%
#Kappa 0.275
#Precision 22.807%
#Recall 0.59091 %
#AUC 0.602

capture101

Now running logistic regression on the over sampled data yielded the following results

#Accuracy 84.71 %
#Kappa 0.3265
#Precision 0.40351
#Recall 0.42593
#AUC 0.660

capture102

From both the models we can see when we use auc as our metric the over sampled data is clearly the winner. Also we will rely on the second model more because the kappa value is higher and precision recall values are closer.

One massive problem thanks to null deviance we face is that our accuracy after running our best model is 84.71% ; And our accuracy by running no model and stating customer retained is 85.8%. Means our model is not as effective as we would think. This means we either should try feature engineering or a different model.

As this data is falsified could be that our accuracy will always be bad, but lets assume logistic yielded a good result, let us try to understand the equation then,

Coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) 8.318e+00 1.078e+00 7.713 1.23e-14 ***
StateAL 1.767e-02 5.858e-01 0.030 0.975929 
StateAR -6.454e-01 6.095e-01 -1.059 0.289611 
StateAZ 6.812e-01 6.998e-01 0.973 0.330355 
StateCA -2.015e+00 5.959e-01 -3.382 0.000719 ***
StateCO -5.204e-01 5.958e-01 -0.873 0.382438 
StateCT -9.923e-01 5.768e-01 -1.720 0.085364 . 
StateDC -7.558e-01 6.655e-01 -1.136 0.256046 
StateDE -8.036e-01 5.780e-01 -1.390 0.164413 
StateFL -5.576e-01 5.832e-01 -0.956 0.339026 
StateGA -8.233e-01 5.512e-01 -1.494 0.135263 
StateHI 3.252e-01 7.279e-01 0.447 0.655050 
StateIA 2.778e-02 7.161e-01 0.039 0.969051 
StateID -4.276e-01 5.647e-01 -0.757 0.448895 
StateIL -1.270e+00 5.720e-01 -2.220 0.026441 * 
StateIN -7.517e-01 5.884e-01 -1.278 0.201372 
StateKS -1.164e+00 5.425e-01 -2.145 0.031918 * 
StateKY -7.379e-01 6.068e-01 -1.216 0.223949 
StateLA -1.080e+00 5.963e-01 -1.811 0.070173 . 
StateMA -1.541e+00 5.610e-01 -2.746 0.006032 ** 
StateMD -1.164e+00 5.565e-01 -2.092 0.036455 * 
StateME -1.915e+00 5.471e-01 -3.500 0.000465 ***
StateMI -1.501e+00 5.746e-01 -2.612 0.009011 ** 
StateMN -8.528e-01 5.486e-01 -1.555 0.120064 
StateMO 2.519e-01 6.291e-01 0.400 0.688826 
StateMS -1.467e+00 5.614e-01 -2.613 0.008987 ** 
StateMT -1.447e+00 5.473e-01 -2.644 0.008181 ** 
StateNC -8.929e-01 5.817e-01 -1.535 0.124820 
StateND -6.750e-01 6.037e-01 -1.118 0.263526 
StateNE -6.011e-01 5.911e-01 -1.017 0.309221 
StateNH -8.939e-01 6.064e-01 -1.474 0.140435 
StateNJ -1.738e+00 5.556e-01 -3.128 0.001761 ** 
StateNM -1.151e+00 5.471e-01 -2.104 0.035366 * 
StateNV -1.757e+00 5.525e-01 -3.180 0.001473 ** 
StateNY -1.080e+00 5.650e-01 -1.912 0.055908 . 
StateOH -5.434e-01 5.577e-01 -0.974 0.329891 
StateOK -1.484e+00 5.837e-01 -2.543 0.011001 * 
StateOR -4.159e-01 5.561e-01 -0.748 0.454572 
StatePA -8.248e-01 6.262e-01 -1.317 0.187836 
StateRI 4.828e-01 6.553e-01 0.737 0.461271 
StateSC -1.327e+00 5.734e-01 -2.313 0.020695 * 
StateSD -1.419e+00 5.936e-01 -2.390 0.016838 * 
StateTN -2.747e-01 5.931e-01 -0.463 0.643201 
StateTX -2.148e+00 5.466e-01 -3.929 8.53e-05 ***
StateUT -7.398e-01 5.785e-01 -1.279 0.200914 
StateVA 7.518e-01 6.311e-01 1.191 0.233547 
StateVT -4.988e-01 5.869e-01 -0.850 0.395327 
StateWA -1.369e+00 5.698e-01 -2.402 0.016308 * 
StateWI -2.333e-01 5.906e-01 -0.395 0.692830 
StateWV -4.497e-01 5.560e-01 -0.809 0.418600 
StateWY -1.921e-01 5.780e-01 -0.332 0.739637 
Account_Length -1.719e-03 1.189e-03 -1.446 0.148198 
Area_Code 1.860e-03 1.085e-03 1.714 0.086489 . 
Phone_No -1.627e-07 1.687e-07 -0.964 0.334881 
International_Plan yes -2.516e+00 1.206e-01 -20.858 < 2e-16 ***
Voice_Mail_Plan yes -1.028e-01 1.447e-01 -0.710 0.477407 
No_Vmail_Messages -2.941e-03 5.303e-03 -0.555 0.579144 
Total_Day_minutes -4.437e+00 2.775e+00 -1.599 0.109815 
Total_Day_Calls 3.982e-05 2.389e-03 0.017 0.986701 
Total_Day_charge 2.603e+01 1.632e+01 1.595 0.110808 
Total_Eve_Minutes -1.862e+00 1.418e+00 -1.313 0.189311 
Total_Eve_Calls -4.211e-03 2.379e-03 -1.770 0.076674 . 
Total_Eve_Charge 2.182e+01 1.668e+01 1.308 0.190938 
Total_Night_Minutes 9.630e-01 7.453e-01 1.292 0.196293 
Total_Night_Calls -6.086e-04 2.392e-03 -0.254 0.799175 
Total_Night_Charge -2.143e+01 1.656e+01 -1.294 0.195715 
Total_Intl_Minutes 2.219e+00 4.482e+00 0.495 0.620579 
Total_Intl_Calls 1.075e-01 2.053e-02 5.233 1.67e-07 ***
Total_Intl_Charge -8.763e+00 1.660e+01 -0.528 0.597585 
No_CS_Calls -5.475e-01 3.540e-02 -15.466 < 2e-16 ***
---

Cant read it ? well think you just made this model and your boss calls up and asks you, there is a customer his state his NV his total calls, charges and duration is xyz , Will he leave the telecom operator? if yes please explain?

What will you say , well its easy you look at the above table and start. Every factor that your boss gave fits in the equation and you could quantitatively justify your answer. All of this thanks to the historical data.

For simplicity lets consider equation

y = 45    +  60*(age)
where y=salary
45=intercept

How would you interpret this equation, it obvious you would say as age increases , so does salary increase. right?

How ever think again and think hard this time, what if I told you age is 0? Now explain it to me? Im sure you understood here that a newborn cannot have a salary of 45 $ without doing anything. This is where business understanding or domain knowledge comes into play.

We should usually avoid explaining the intercept unless the business understanding , helps you to explain it. But this is a Gray area, so its better to avoid explaining it , then to make a mess out of it.

However imagine if this same equation was for a packet of wafers

y = 45 + 0.1(weight)

Here we could simply say that mean weight that should be in a packet of wafers is 45 gms, however that is not always true so a variance factor in the form of coefficients is added.

That is why intercept at some places could be explained and some places cannot be.

You can find the code for logistic regression Here ->

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Logistic_Regression.R

Determining Feature Importance For Telecom Data

We have a complete data set  -> Check

Feature engineering done -> Check

How many variables do we have?    20 variables

How many should we ideally use ?   Not more that 10 ideally

How to determine which variables to include and which not to ?   Its simple do Boruta!!

Whats Boruta?

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. You can read about it here ->  https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/   Analytics vidhya has given a pretty good explanation about it here.

Now keep one important thing in mind we have two train sets 1)Normal train set  2)Smote Train set.

So upon running boruta on the normal train set, boruta confirmed the variables International_Plan,Voice_Mail_Plan ,No_Vmail_Messages,Total_Day_minutes,
Total_Day_charge,Total_Eve_Minutes,Total_Eve_Charge,Total_Night_Minutes,
Total_Night_Charge , Total_Intl_Minutes,Total_Intl_Calls,Total_Intl_Charge,
No_CS_Calls as important.

And upon running Boruta on Smote data set, Boruta confirmed all the variables as important, you can find the boruta code below

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Boruta_Imp_FE.R

Feature Engineering On Telecom Data

Although the Telecom data provided by https://www.sgi.com/tech/mlc/db/ has no missing values , there is a landslide of class imbalance.

That is why the only thing we will concentrate in our feature engineering is eliminating class imbalance.

> summary(train$Customer_Left)
False True 
 2850 483

Its Visible that retained customers in our training set is 2850 and customer who left are 483. Because of this I will do oversampling on the customers who left to balance the data set.

Let us assume that I do not over sample , then by even not making any model I can simply say customer retained and still be right 85.8% of the time. In order to break this bias I use a package known as SMOTE(Synthetic minority oversampling technique ) you can read about the research paper published in the Journal of Artificial Intelligence Research 16 (2002) here -> https://www.jair.org/media/953/live-953-2037-jair.pdf

> train$Customer_Left<-as.numeric(train$Customer_Left)
> summary(as.factor(train$Customer_Left))
 1 2 
 483 2850 
> train$Customer_Left[train$Customer_Left==2]<-0
> summary(as.factor(train$Customer_Left))
 0 1 
2850 483  
#here false ->1
# true ->0
> train$Customer_Left<-as.factor(train$Customer_Left)
> ntrain<-SMOTE(Customer_Left~.,train,perc.over=200,k = 3)
> ntrain$Customer_Left<-as.factor(ntrain$Customer_Left)
> summary(ntrain$Customer_Left)
 0 1 
1932 1449

Now we have simply under sampled retained customers from 2850 to 1932 and over sampled customers who left the operator from 483 to 1449.

Now train has been manipulated hence I also had to manipulate test once.

You can see the complete code here ->  https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/FE.R

NOTE-> The libararies.R file consists code that loads packages needed and if they are not installed on your machine it will download and then install them.

Churn Analysis On Telecom Data

One of the major problems that telecom operators face is customer retention. Because of which majority of the Telecom operators want to know which customer is most likely to leave them, so that they could immediately take certain actions like providing a discount or providing a customised plan, so that they could retain the customer.

However accuracy required while building a churn analysis model needs to be very high, imagine if our model has a accuracy of just 75% and the total number of customers who want to leave are just 5% , this leaves a margin of 20% of customers who were wrongly classified as customers who will leave the operator. If an operator has 10000 customers,And 2500 customers are predicted to leave , the operator may have to release lets assume a 1$ credit to all that’s a cost of 2500$, where as credits that required to be released was only for 5% of the customer’s that is a cost of 500$, hence the operator spent 2000$ for no reason. If the operator has high number of customers it would lead to a huge loss.

Coming to the data quotient, there is no freely available telecom data as far as I know available, however the website https://www.sgi.com/tech/mlc/db/ provides data for churn analysis, this data is not real but represents real world scenarios and is good from the perspective of understanding and learning.

The data on the website is classified into train and test has no NA’s means no feature engineering as such to be done before running models on it.

Now comes the question of which models to run on it. Some would say since we need very high accuracy hence we will run xgboost or random forest, however the downside we have here is that we cannot explain to the operator on what basis is XGBOOST or random forest determining why will the customer leave him. Even if we manage to explain its very complicated and will not be accepted.

Because of this we will have to take support on models that can be easily explained to the customer. This leaves us with two models for classification .i.e. customer leaves -> 0 or customer is retained -> 1. So the models are Logistic regression and decision tree.

Why Logistic Regression ?  well because we can explain to the operator why customer is leaving him thanks to the logit equation.

Why Decision Tree? well because there is a neat flow of how our tree makes decision by breaking variables and deciding yes and no based on entropy and impurity.

Further in this post category I will show feature engineering to Running models, to interpretation.

The data available from the website is a bit complex to save to a CSV file so if you need you can download the train and test data from below.

Also explanation of variables is not provided as it is fairly simple.

https://github.com/mmd52/Telecom_Churn_Analysis

Paper On Using Various Classifier Algorithms and Scaling Up Accuracy Of A Model

machine-learning-on-uci-adult-data-set-using-various-classifier-algorithms-and-scaling-up-the-accuracy-using-extreme-gradient-boosting

Revised Approach To UCI ADULT DATA SET

If you have seen the posts in the uci adult data set section, you may have realised I am not going above 86% with accuracy.

An important thing I learnt the hard way was to never eliminate rows in a data set. Its fine to eliminate columns having NA values above 30% but never eliminate rows.

Because of this I had to redo my feature engineering. So how to fix my missing NA values , well what i did was , I opened my data set in excel and converted all ‘?’ mark values to ‘NA’

This would make feature engineering more simple. The next step is to identify columns with missing values, and see if their missing values were greater than 30% in totality.

In our case type_employer had 1836 missing values

occupation had a further 1843 missing values

and country had 583 missing values.

So what I did was , I predicted the missing values with the help of other independent variables(No I didnt add income here for predicting them). Once my model was made i used it to replace the missing values in the columns. Thus i had a clean data set with no missing values.

I admit the predictions were not that great , but they were tolerable.

Because of which when I ran the following models my accuracy skyrocketed

  1. Logistic Regression -> 85.38%
  2. Random Forest(Excluding variable country)  -> 87.11%
  3. SVM -> 85.8%
  4. XGBOOST with 10 folds -> 87.08%

Continue reading “Revised Approach To UCI ADULT DATA SET”

Support Vector Machine

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyper-plane. In other words, given labelled training data (supervised learning), the algorithm outputs an optimal hyper-plane which categorises new examples.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

When data are not labelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The clustering algorithm which provides an improvement to the support vector machines is called support vector clustering and is often used in industrial applications either when data are not labelled or when only some data are labelled as a preprocessing for a classification pass.

SVM can do some amazing predictions , for example when you use tune function specify a range of cost and epsilon values. The tune function automatically picks up the best SVM model for us with the least possible error.

Support vector machine algorithms can be very computational intensive and in our case the are with the large number of data rows. It took my machine 10 hours to process the model completely.

 

Extreme Gradient Boosting

The term ‘Boosting’ refers to a family of algorithms which converts weak learner’s to strong learners.

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

  1. Email has only one image file (promotional image), It’s a SPAM
  2. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
  3. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learners.To convert a weak learner to a strong learner, we’ll combine the prediction of each weak learner to form one definitive strong learner.

Continue reading “Extreme Gradient Boosting”