Tag: dataset

R Shiny Apps for Time Series

Like the name suggests shinyy**** .

Shiny is a new package from RStudio that makes it incredibly easy to build interactive web applications with R. For an introduction and live examples, visit the Shiny homepage.

Why shiny ?  Well for starters its free and simple to use and deploy. If you are planning to use shiny commercially , you will have to pay for hosting your apps, but for the rest its free and you can easily deploy your shiny apps online on https://www.shinyapps.io

So why is shiny useful to us, It can be used to make an interactive dashboard design or to allow a person to interact with R from the GUI, no coding for the end user involved. They are also highly dynamic and can be customised to tweak settings as the end user likes.

My problem, For several time series data sets I faced the problem of repetitively checking a few common things like if the data is stationary or not? , how does the data look like?, Does it require transformation and most importantly from a lazy mans perspective, will auto.arima do the trick 😛

I decided to automate this manual task via SHINY and demonstrate a small example.So what does my shiny app do?

  1. It accepts single column input from any text file that you feed in
  2. It will ask user if there is a header
  3. The start year , month and frequency of the data
  4. Using this information it will plot a PACF and ACF graph
  5. It will also execute auto.arima and plot the normal time series data, to get an understanding.

This is a small example and hence it is simple, however we could make much complicated things. However for any person performing time series this app just saved his precious time of doing non trivial work.

So to run a shiny app , we require to code two files, one for the UI and one for the back-end processing, ie ui.R and server.R

Both these files

library(shiny)

shiny offers an extensive tutorial on -> https://shiny.rstudio.com/

You can view my app on -> https://mohammedtopiwalla.shinyapps.io/arima_shiny/

My code is relatively simple to , you can view that on github at ->  https://github.com/mmd52/Arima_Shiny

to run the app you will need a text file with time series data, you can download a sample from -> https://github.com/mmd52/Arima_Shiny/blob/master/data.txt

 

Advertisements

Running Various Models on Pima Indian Diabetesdata set

EDA was done various inferences found , now we will run various models and verify whether predictions match with the inferences.

As I have mentioned in the previous post , my focus is on the code and inference , which you can find in the python notebooks or R files.

R
Model Accuracy Precision Recall Kappa AUC
Decion Tree 73.48 75.33 82.48 0.4368 0.727
Naïve Bayes 75.22 82 80.39 0.4489 0.723
KNN 73.91 86.67 76.47 0.3894 0.683
Logistic Regression 76.09 82.67 81.05 0.4683 0.732
SVM Simple 73.91 86.67 76.47 0.3894 0.683
SVM 10 Folds 73.04 82.67 77.5 0.388 0.6883
SVM Linear 10 Folds 78.26 88.67 80.12 0.4974 0.7371
Random Forest 76.52 84 80.77 0.4733 0.733
XGBOOST 77.83 91.61 77.06 0.4981 0.843
Python
Model Accuracy Precision Recall Kappa AUC
Decion Tree 72.73 73 73 0.388 0.7
Naïve Bayes 80.51 80 81 0.5689 0.78
KNN 70.99 70 71 0.337 0.66
Logistic Regression 74.45 74 74 0.3956 0.68
SVM Simple 73.16 73 73 0.4007 0.69
Random Forest 76.62 77 77 0.48 0.73
XGBOOST 79.22 79 79 0.526 0.76

As we can see from the above tables XGBOOST was the clear winner for both the languages.

The Code for Python you can find at -> https://github.com/mmd52/Pima_Python

The code for R you can find at -> https://github.com/mmd52/Pima_R

Exploratory Data Analysis

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots.

The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it.

So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot.

For one numeric and other factor bar plots seem like a good option.

And for two numeric variables we have out faithful scatter plot to the rescue.

In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code.

I strongly suggest you view the code below, which has inferences and a well documented structure.

You can download the data from

DATA-> https://github.com/mmd52/Pima_R (A file named as diabetes.csv is the one)

R Code ->  https://github.com/mmd52/Pima_R/blob/master/EDA.R (A fair warning to execute the EDA code in R you will first need to execute the https://github.com/mmd52/Pima_R/blob/master/Libraries.R and https://github.com/mmd52/Pima_R/blob/master/Data.R)

Python Code-> https://github.com/mmd52/Pima_Python/blob/master/EDA.ipynb (Its a Jupyter Notebook)

Models on UCI PIMA DataSet

The Idea behind using this data set from the UCI repository is not just running models, but deriving inferences that match to the real world.

This makes predictions we make all the more sensible and strong especially when we have understood the data set and have derived correct inferences from it which match our predictions.

Our approach to this data set will be to perform the following

  1. Exploratory data analysis while deriving inferences from it
  2. Using techniques like PCA and checking cor relationship between data
  3.  Running various models and making inferences from the predictions

We will do all of this in R , and in Python.

The data now provided by UCI ->

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Sc

Let us first begin to understand the problem first, and what better explain the problem then a short video which you can view from here -> https://youtu.be/pN4HqWRybwk

So from the video we understand that the PIMA Indian tribe has a gene which gets aggravated on eating food high with sugar. So UCI pima indian data set has a collection of data of females from the pima tribe. In the data set of 768 rows 268 of them have diabetes.

You can find the data set description here – > https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names

The problem statement is to correctly classify and predict if a female has diabetes or no. Thus its a classification Problem.

Good news for us is that the data set has no null or missing values and to top the cherry on our ice cream is completely numeric. Only the target variable outcome and pregnancies are factor variables. The remaining variables are continuous numeric variables.

Decision Tree and Interpretation on Telecom Data

We saw that logistic Regression was a bad model for our telecom churn analysis, that leaves us with Decision tree.

Again we have two data sets the original data and the over sampled data. We run decision tree model on both of them and compare our results.

So running decision tree on the normal data set yielded better results as compared to running on the over sampled data set

Accuracy Kappa Precision Recall Auc
Data 0.9482 0.7513 0.68421 0.90698 0.837
Over Sampled Data 0.8894 0.5274 0.5965 0.5862 0.7656

Unfortunately the decision tree plot was too big for me to put it in this post.

As decision tree is giving the highest level of accuracy , we will select it as the clear winner for our telecom churn analysis problem.

Another major advantage of decision tree is that it could be explained graphically very easily to the end business user on why a particular choice is being made.

You can find the code for decision tree here->

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Decision_Tree.R

This was a dummy database and may not have yielded the best results , but is a perfect exercise for practice.

Logistic Regression And Interpretation On Telecom Data

If you have read my previous posts, you may have understood how feature engineering was done and why we are running a logistic regression n this data.

It is essential to understand we have two train sets

  1. The original train set
  2. The over sampled train set

Running Logistic regression on the normal data set yielded the following results

#Accuracy 87.53%
#Kappa 0.275
#Precision 22.807%
#Recall 0.59091 %
#AUC 0.602

capture101

Now running logistic regression on the over sampled data yielded the following results

#Accuracy 84.71 %
#Kappa 0.3265
#Precision 0.40351
#Recall 0.42593
#AUC 0.660

capture102

From both the models we can see when we use auc as our metric the over sampled data is clearly the winner. Also we will rely on the second model more because the kappa value is higher and precision recall values are closer.

One massive problem thanks to null deviance we face is that our accuracy after running our best model is 84.71% ; And our accuracy by running no model and stating customer retained is 85.8%. Means our model is not as effective as we would think. This means we either should try feature engineering or a different model.

As this data is falsified could be that our accuracy will always be bad, but lets assume logistic yielded a good result, let us try to understand the equation then,

Coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) 8.318e+00 1.078e+00 7.713 1.23e-14 ***
StateAL 1.767e-02 5.858e-01 0.030 0.975929 
StateAR -6.454e-01 6.095e-01 -1.059 0.289611 
StateAZ 6.812e-01 6.998e-01 0.973 0.330355 
StateCA -2.015e+00 5.959e-01 -3.382 0.000719 ***
StateCO -5.204e-01 5.958e-01 -0.873 0.382438 
StateCT -9.923e-01 5.768e-01 -1.720 0.085364 . 
StateDC -7.558e-01 6.655e-01 -1.136 0.256046 
StateDE -8.036e-01 5.780e-01 -1.390 0.164413 
StateFL -5.576e-01 5.832e-01 -0.956 0.339026 
StateGA -8.233e-01 5.512e-01 -1.494 0.135263 
StateHI 3.252e-01 7.279e-01 0.447 0.655050 
StateIA 2.778e-02 7.161e-01 0.039 0.969051 
StateID -4.276e-01 5.647e-01 -0.757 0.448895 
StateIL -1.270e+00 5.720e-01 -2.220 0.026441 * 
StateIN -7.517e-01 5.884e-01 -1.278 0.201372 
StateKS -1.164e+00 5.425e-01 -2.145 0.031918 * 
StateKY -7.379e-01 6.068e-01 -1.216 0.223949 
StateLA -1.080e+00 5.963e-01 -1.811 0.070173 . 
StateMA -1.541e+00 5.610e-01 -2.746 0.006032 ** 
StateMD -1.164e+00 5.565e-01 -2.092 0.036455 * 
StateME -1.915e+00 5.471e-01 -3.500 0.000465 ***
StateMI -1.501e+00 5.746e-01 -2.612 0.009011 ** 
StateMN -8.528e-01 5.486e-01 -1.555 0.120064 
StateMO 2.519e-01 6.291e-01 0.400 0.688826 
StateMS -1.467e+00 5.614e-01 -2.613 0.008987 ** 
StateMT -1.447e+00 5.473e-01 -2.644 0.008181 ** 
StateNC -8.929e-01 5.817e-01 -1.535 0.124820 
StateND -6.750e-01 6.037e-01 -1.118 0.263526 
StateNE -6.011e-01 5.911e-01 -1.017 0.309221 
StateNH -8.939e-01 6.064e-01 -1.474 0.140435 
StateNJ -1.738e+00 5.556e-01 -3.128 0.001761 ** 
StateNM -1.151e+00 5.471e-01 -2.104 0.035366 * 
StateNV -1.757e+00 5.525e-01 -3.180 0.001473 ** 
StateNY -1.080e+00 5.650e-01 -1.912 0.055908 . 
StateOH -5.434e-01 5.577e-01 -0.974 0.329891 
StateOK -1.484e+00 5.837e-01 -2.543 0.011001 * 
StateOR -4.159e-01 5.561e-01 -0.748 0.454572 
StatePA -8.248e-01 6.262e-01 -1.317 0.187836 
StateRI 4.828e-01 6.553e-01 0.737 0.461271 
StateSC -1.327e+00 5.734e-01 -2.313 0.020695 * 
StateSD -1.419e+00 5.936e-01 -2.390 0.016838 * 
StateTN -2.747e-01 5.931e-01 -0.463 0.643201 
StateTX -2.148e+00 5.466e-01 -3.929 8.53e-05 ***
StateUT -7.398e-01 5.785e-01 -1.279 0.200914 
StateVA 7.518e-01 6.311e-01 1.191 0.233547 
StateVT -4.988e-01 5.869e-01 -0.850 0.395327 
StateWA -1.369e+00 5.698e-01 -2.402 0.016308 * 
StateWI -2.333e-01 5.906e-01 -0.395 0.692830 
StateWV -4.497e-01 5.560e-01 -0.809 0.418600 
StateWY -1.921e-01 5.780e-01 -0.332 0.739637 
Account_Length -1.719e-03 1.189e-03 -1.446 0.148198 
Area_Code 1.860e-03 1.085e-03 1.714 0.086489 . 
Phone_No -1.627e-07 1.687e-07 -0.964 0.334881 
International_Plan yes -2.516e+00 1.206e-01 -20.858 < 2e-16 ***
Voice_Mail_Plan yes -1.028e-01 1.447e-01 -0.710 0.477407 
No_Vmail_Messages -2.941e-03 5.303e-03 -0.555 0.579144 
Total_Day_minutes -4.437e+00 2.775e+00 -1.599 0.109815 
Total_Day_Calls 3.982e-05 2.389e-03 0.017 0.986701 
Total_Day_charge 2.603e+01 1.632e+01 1.595 0.110808 
Total_Eve_Minutes -1.862e+00 1.418e+00 -1.313 0.189311 
Total_Eve_Calls -4.211e-03 2.379e-03 -1.770 0.076674 . 
Total_Eve_Charge 2.182e+01 1.668e+01 1.308 0.190938 
Total_Night_Minutes 9.630e-01 7.453e-01 1.292 0.196293 
Total_Night_Calls -6.086e-04 2.392e-03 -0.254 0.799175 
Total_Night_Charge -2.143e+01 1.656e+01 -1.294 0.195715 
Total_Intl_Minutes 2.219e+00 4.482e+00 0.495 0.620579 
Total_Intl_Calls 1.075e-01 2.053e-02 5.233 1.67e-07 ***
Total_Intl_Charge -8.763e+00 1.660e+01 -0.528 0.597585 
No_CS_Calls -5.475e-01 3.540e-02 -15.466 < 2e-16 ***
---

Cant read it ? well think you just made this model and your boss calls up and asks you, there is a customer his state his NV his total calls, charges and duration is xyz , Will he leave the telecom operator? if yes please explain?

What will you say , well its easy you look at the above table and start. Every factor that your boss gave fits in the equation and you could quantitatively justify your answer. All of this thanks to the historical data.

For simplicity lets consider equation

y = 45    +  60*(age)
where y=salary
45=intercept

How would you interpret this equation, it obvious you would say as age increases , so does salary increase. right?

How ever think again and think hard this time, what if I told you age is 0? Now explain it to me? Im sure you understood here that a newborn cannot have a salary of 45 $ without doing anything. This is where business understanding or domain knowledge comes into play.

We should usually avoid explaining the intercept unless the business understanding , helps you to explain it. But this is a Gray area, so its better to avoid explaining it , then to make a mess out of it.

However imagine if this same equation was for a packet of wafers

y = 45 + 0.1(weight)

Here we could simply say that mean weight that should be in a packet of wafers is 45 gms, however that is not always true so a variance factor in the form of coefficients is added.

That is why intercept at some places could be explained and some places cannot be.

You can find the code for logistic regression Here ->

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Logistic_Regression.R

Determining Feature Importance For Telecom Data

We have a complete data set  -> Check

Feature engineering done -> Check

How many variables do we have?    20 variables

How many should we ideally use ?   Not more that 10 ideally

How to determine which variables to include and which not to ?   Its simple do Boruta!!

Whats Boruta?

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. You can read about it here ->  https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/   Analytics vidhya has given a pretty good explanation about it here.

Now keep one important thing in mind we have two train sets 1)Normal train set  2)Smote Train set.

So upon running boruta on the normal train set, boruta confirmed the variables International_Plan,Voice_Mail_Plan ,No_Vmail_Messages,Total_Day_minutes,
Total_Day_charge,Total_Eve_Minutes,Total_Eve_Charge,Total_Night_Minutes,
Total_Night_Charge , Total_Intl_Minutes,Total_Intl_Calls,Total_Intl_Charge,
No_CS_Calls as important.

And upon running Boruta on Smote data set, Boruta confirmed all the variables as important, you can find the boruta code below

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Boruta_Imp_FE.R

Feature Engineering On Telecom Data

Although the Telecom data provided by https://www.sgi.com/tech/mlc/db/ has no missing values , there is a landslide of class imbalance.

That is why the only thing we will concentrate in our feature engineering is eliminating class imbalance.

> summary(train$Customer_Left)
False True 
 2850 483

Its Visible that retained customers in our training set is 2850 and customer who left are 483. Because of this I will do oversampling on the customers who left to balance the data set.

Let us assume that I do not over sample , then by even not making any model I can simply say customer retained and still be right 85.8% of the time. In order to break this bias I use a package known as SMOTE(Synthetic minority oversampling technique ) you can read about the research paper published in the Journal of Artificial Intelligence Research 16 (2002) here -> https://www.jair.org/media/953/live-953-2037-jair.pdf

> train$Customer_Left<-as.numeric(train$Customer_Left)
> summary(as.factor(train$Customer_Left))
 1 2 
 483 2850 
> train$Customer_Left[train$Customer_Left==2]<-0
> summary(as.factor(train$Customer_Left))
 0 1 
2850 483  
#here false ->1
# true ->0
> train$Customer_Left<-as.factor(train$Customer_Left)
> ntrain<-SMOTE(Customer_Left~.,train,perc.over=200,k = 3)
> ntrain$Customer_Left<-as.factor(ntrain$Customer_Left)
> summary(ntrain$Customer_Left)
 0 1 
1932 1449

Now we have simply under sampled retained customers from 2850 to 1932 and over sampled customers who left the operator from 483 to 1449.

Now train has been manipulated hence I also had to manipulate test once.

You can see the complete code here ->  https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/FE.R

NOTE-> The libararies.R file consists code that loads packages needed and if they are not installed on your machine it will download and then install them.

Churn Analysis On Telecom Data

One of the major problems that telecom operators face is customer retention. Because of which majority of the Telecom operators want to know which customer is most likely to leave them, so that they could immediately take certain actions like providing a discount or providing a customised plan, so that they could retain the customer.

However accuracy required while building a churn analysis model needs to be very high, imagine if our model has a accuracy of just 75% and the total number of customers who want to leave are just 5% , this leaves a margin of 20% of customers who were wrongly classified as customers who will leave the operator. If an operator has 10000 customers,And 2500 customers are predicted to leave , the operator may have to release lets assume a 1$ credit to all that’s a cost of 2500$, where as credits that required to be released was only for 5% of the customer’s that is a cost of 500$, hence the operator spent 2000$ for no reason. If the operator has high number of customers it would lead to a huge loss.

Coming to the data quotient, there is no freely available telecom data as far as I know available, however the website https://www.sgi.com/tech/mlc/db/ provides data for churn analysis, this data is not real but represents real world scenarios and is good from the perspective of understanding and learning.

The data on the website is classified into train and test has no NA’s means no feature engineering as such to be done before running models on it.

Now comes the question of which models to run on it. Some would say since we need very high accuracy hence we will run xgboost or random forest, however the downside we have here is that we cannot explain to the operator on what basis is XGBOOST or random forest determining why will the customer leave him. Even if we manage to explain its very complicated and will not be accepted.

Because of this we will have to take support on models that can be easily explained to the customer. This leaves us with two models for classification .i.e. customer leaves -> 0 or customer is retained -> 1. So the models are Logistic regression and decision tree.

Why Logistic Regression ?  well because we can explain to the operator why customer is leaving him thanks to the logit equation.

Why Decision Tree? well because there is a neat flow of how our tree makes decision by breaking variables and deciding yes and no based on entropy and impurity.

Further in this post category I will show feature engineering to Running models, to interpretation.

The data available from the website is a bit complex to save to a CSV file so if you need you can download the train and test data from below.

Also explanation of variables is not provided as it is fairly simple.

https://github.com/mmd52/Telecom_Churn_Analysis

Paper On Using Various Classifier Algorithms and Scaling Up Accuracy Of A Model

machine-learning-on-uci-adult-data-set-using-various-classifier-algorithms-and-scaling-up-the-accuracy-using-extreme-gradient-boosting