Tag: dataanalytics

R Shiny Apps for Time Series

Like the name suggests shinyy**** .

Shiny is a new package from RStudio that makes it incredibly easy to build interactive web applications with R. For an introduction and live examples, visit the Shiny homepage.

Why shiny ?  Well for starters its free and simple to use and deploy. If you are planning to use shiny commercially , you will have to pay for hosting your apps, but for the rest its free and you can easily deploy your shiny apps online on https://www.shinyapps.io

So why is shiny useful to us, It can be used to make an interactive dashboard design or to allow a person to interact with R from the GUI, no coding for the end user involved. They are also highly dynamic and can be customised to tweak settings as the end user likes.

My problem, For several time series data sets I faced the problem of repetitively checking a few common things like if the data is stationary or not? , how does the data look like?, Does it require transformation and most importantly from a lazy mans perspective, will auto.arima do the trick 😛

I decided to automate this manual task via SHINY and demonstrate a small example.So what does my shiny app do?

  1. It accepts single column input from any text file that you feed in
  2. It will ask user if there is a header
  3. The start year , month and frequency of the data
  4. Using this information it will plot a PACF and ACF graph
  5. It will also execute auto.arima and plot the normal time series data, to get an understanding.

This is a small example and hence it is simple, however we could make much complicated things. However for any person performing time series this app just saved his precious time of doing non trivial work.

So to run a shiny app , we require to code two files, one for the UI and one for the back-end processing, ie ui.R and server.R

Both these files

library(shiny)

shiny offers an extensive tutorial on -> https://shiny.rstudio.com/

You can view my app on -> https://mohammedtopiwalla.shinyapps.io/arima_shiny/

My code is relatively simple to , you can view that on github at ->  https://github.com/mmd52/Arima_Shiny

to run the app you will need a text file with time series data, you can download a sample from -> https://github.com/mmd52/Arima_Shiny/blob/master/data.txt

 

Running Various Models on Pima Indian Diabetesdata set

EDA was done various inferences found , now we will run various models and verify whether predictions match with the inferences.

As I have mentioned in the previous post , my focus is on the code and inference , which you can find in the python notebooks or R files.

R
Model Accuracy Precision Recall Kappa AUC
Decion Tree 73.48 75.33 82.48 0.4368 0.727
Naïve Bayes 75.22 82 80.39 0.4489 0.723
KNN 73.91 86.67 76.47 0.3894 0.683
Logistic Regression 76.09 82.67 81.05 0.4683 0.732
SVM Simple 73.91 86.67 76.47 0.3894 0.683
SVM 10 Folds 73.04 82.67 77.5 0.388 0.6883
SVM Linear 10 Folds 78.26 88.67 80.12 0.4974 0.7371
Random Forest 76.52 84 80.77 0.4733 0.733
XGBOOST 77.83 91.61 77.06 0.4981 0.843
Python
Model Accuracy Precision Recall Kappa AUC
Decion Tree 72.73 73 73 0.388 0.7
Naïve Bayes 80.51 80 81 0.5689 0.78
KNN 70.99 70 71 0.337 0.66
Logistic Regression 74.45 74 74 0.3956 0.68
SVM Simple 73.16 73 73 0.4007 0.69
Random Forest 76.62 77 77 0.48 0.73
XGBOOST 79.22 79 79 0.526 0.76

As we can see from the above tables XGBOOST was the clear winner for both the languages.

The Code for Python you can find at -> https://github.com/mmd52/Pima_Python

The code for R you can find at -> https://github.com/mmd52/Pima_R

Exploratory Data Analysis

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots.

The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it.

So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot.

For one numeric and other factor bar plots seem like a good option.

And for two numeric variables we have out faithful scatter plot to the rescue.

In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code.

I strongly suggest you view the code below, which has inferences and a well documented structure.

You can download the data from

DATA-> https://github.com/mmd52/Pima_R (A file named as diabetes.csv is the one)

R Code ->  https://github.com/mmd52/Pima_R/blob/master/EDA.R (A fair warning to execute the EDA code in R you will first need to execute the https://github.com/mmd52/Pima_R/blob/master/Libraries.R and https://github.com/mmd52/Pima_R/blob/master/Data.R)

Python Code-> https://github.com/mmd52/Pima_Python/blob/master/EDA.ipynb (Its a Jupyter Notebook)

Decision Tree and Interpretation on Telecom Data

We saw that logistic Regression was a bad model for our telecom churn analysis, that leaves us with Decision tree.

Again we have two data sets the original data and the over sampled data. We run decision tree model on both of them and compare our results.

So running decision tree on the normal data set yielded better results as compared to running on the over sampled data set

Accuracy Kappa Precision Recall Auc
Data 0.9482 0.7513 0.68421 0.90698 0.837
Over Sampled Data 0.8894 0.5274 0.5965 0.5862 0.7656

Unfortunately the decision tree plot was too big for me to put it in this post.

As decision tree is giving the highest level of accuracy , we will select it as the clear winner for our telecom churn analysis problem.

Another major advantage of decision tree is that it could be explained graphically very easily to the end business user on why a particular choice is being made.

You can find the code for decision tree here->

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Decision_Tree.R

This was a dummy database and may not have yielded the best results , but is a perfect exercise for practice.

Determining Feature Importance For Telecom Data

We have a complete data set  -> Check

Feature engineering done -> Check

How many variables do we have?    20 variables

How many should we ideally use ?   Not more that 10 ideally

How to determine which variables to include and which not to ?   Its simple do Boruta!!

Whats Boruta?

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. You can read about it here ->  https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/   Analytics vidhya has given a pretty good explanation about it here.

Now keep one important thing in mind we have two train sets 1)Normal train set  2)Smote Train set.

So upon running boruta on the normal train set, boruta confirmed the variables International_Plan,Voice_Mail_Plan ,No_Vmail_Messages,Total_Day_minutes,
Total_Day_charge,Total_Eve_Minutes,Total_Eve_Charge,Total_Night_Minutes,
Total_Night_Charge , Total_Intl_Minutes,Total_Intl_Calls,Total_Intl_Charge,
No_CS_Calls as important.

And upon running Boruta on Smote data set, Boruta confirmed all the variables as important, you can find the boruta code below

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Boruta_Imp_FE.R

Churn Analysis On Telecom Data

One of the major problems that telecom operators face is customer retention. Because of which majority of the Telecom operators want to know which customer is most likely to leave them, so that they could immediately take certain actions like providing a discount or providing a customised plan, so that they could retain the customer.

However accuracy required while building a churn analysis model needs to be very high, imagine if our model has a accuracy of just 75% and the total number of customers who want to leave are just 5% , this leaves a margin of 20% of customers who were wrongly classified as customers who will leave the operator. If an operator has 10000 customers,And 2500 customers are predicted to leave , the operator may have to release lets assume a 1$ credit to all that’s a cost of 2500$, where as credits that required to be released was only for 5% of the customer’s that is a cost of 500$, hence the operator spent 2000$ for no reason. If the operator has high number of customers it would lead to a huge loss.

Coming to the data quotient, there is no freely available telecom data as far as I know available, however the website https://www.sgi.com/tech/mlc/db/ provides data for churn analysis, this data is not real but represents real world scenarios and is good from the perspective of understanding and learning.

The data on the website is classified into train and test has no NA’s means no feature engineering as such to be done before running models on it.

Now comes the question of which models to run on it. Some would say since we need very high accuracy hence we will run xgboost or random forest, however the downside we have here is that we cannot explain to the operator on what basis is XGBOOST or random forest determining why will the customer leave him. Even if we manage to explain its very complicated and will not be accepted.

Because of this we will have to take support on models that can be easily explained to the customer. This leaves us with two models for classification .i.e. customer leaves -> 0 or customer is retained -> 1. So the models are Logistic regression and decision tree.

Why Logistic Regression ?  well because we can explain to the operator why customer is leaving him thanks to the logit equation.

Why Decision Tree? well because there is a neat flow of how our tree makes decision by breaking variables and deciding yes and no based on entropy and impurity.

Further in this post category I will show feature engineering to Running models, to interpretation.

The data available from the website is a bit complex to save to a CSV file so if you need you can download the train and test data from below.

Also explanation of variables is not provided as it is fairly simple.

https://github.com/mmd52/Telecom_Churn_Analysis

Paper On Using Various Classifier Algorithms and Scaling Up Accuracy Of A Model

machine-learning-on-uci-adult-data-set-using-various-classifier-algorithms-and-scaling-up-the-accuracy-using-extreme-gradient-boosting

Revised Approach To UCI ADULT DATA SET

If you have seen the posts in the uci adult data set section, you may have realised I am not going above 86% with accuracy.

An important thing I learnt the hard way was to never eliminate rows in a data set. Its fine to eliminate columns having NA values above 30% but never eliminate rows.

Because of this I had to redo my feature engineering. So how to fix my missing NA values , well what i did was , I opened my data set in excel and converted all ‘?’ mark values to ‘NA’

This would make feature engineering more simple. The next step is to identify columns with missing values, and see if their missing values were greater than 30% in totality.

In our case type_employer had 1836 missing values

occupation had a further 1843 missing values

and country had 583 missing values.

So what I did was , I predicted the missing values with the help of other independent variables(No I didnt add income here for predicting them). Once my model was made i used it to replace the missing values in the columns. Thus i had a clean data set with no missing values.

I admit the predictions were not that great , but they were tolerable.

Because of which when I ran the following models my accuracy skyrocketed

  1. Logistic Regression -> 85.38%
  2. Random Forest(Excluding variable country)  -> 87.11%
  3. SVM -> 85.8%
  4. XGBOOST with 10 folds -> 87.08%

Continue reading “Revised Approach To UCI ADULT DATA SET”

Extreme Gradient Boosting

The term ‘Boosting’ refers to a family of algorithms which converts weak learner’s to strong learners.

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

  1. Email has only one image file (promotional image), It’s a SPAM
  2. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
  3. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learners.To convert a weak learner to a strong learner, we’ll combine the prediction of each weak learner to form one definitive strong learner.

Continue reading “Extreme Gradient Boosting”

Logistic Regression,Random Forest,SVM on Numerical Data Set

So its been a long time. We have finally got the data just as how we want it.

Great so data is ready and we already have a bit of knowledge on logistic Regression and Random Forest.

So going ahead first with Logistic Regression-

logit.fit=glm(income~.,family=binomial(logit),data=training_data)

on executing this magic line I lie with an accuracy of 80% . Naaaaah , not what we wanted.

so going ahead with Random Forest

bestmtry <- tuneRF(training_data[,-14], as.factor(training_data[,14]),
ntreeTry=100, stepFactor=1.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
rf.fit <- randomForest(income ~ ., data=training_data,
mtry=4, ntree=1000, keep.forest=TRUE, importance=TRUE, test=x_test)

Yes !!

this returned finally an 86 % , it looks like we are doing great. We finally did it!!!!!!

Trying out SVM now.

But wait what is SVM- Support vector machines?

Think of all the data points plotted in space that we cant visualise.But imagine if we had 2D data, then in very vague terms SVM would make lines for us that would help us clearly classify whether a data point belongs to the group 50K and above or 50K and below.

So SVM has hyperplanes these planes are calculated in such a way that they are equidistant from both the classes.

In SVM a plane with maximum margin is a good plane and a plane with minimum margin is a bad plane.

With that said you can find the code for random forest and logistic regression here ->

code

and for svm here->

SVM code