Tag: data

Linear Regression using Tensor Flow

The best thing to do when starting something new is to start doing something simple.

In our case lets do linear regression in which we will try to predict the price of a house with its size. Yes we will use some falsified data but that’s fine.

Well first things first, every thing in tensor flow is in the form of an array, so we begin initialising our data as arrays

#FOR LR
Area=np.array([[987],[452],[876],[201],[349],[195],[1000],[1501],[555],[724],
[652],[328],[895]])
price=np.array([[1974],[904],[1752],[402],[698],[390],[2000],[3002],[1110],
[1448],[1304],[656],[1790]])

Okay so we have area and prices that is our x and y both in the form of a numpy array.

Now the next step is a very crucial step, in this we will determine

  1. Number of iterations
  2. Learning rate
  3. Cost Function

Why the above 3 steps? well we do it to find the smallest error. We make use of Gradient Descent

learning_rate = 0.01
training_epochs = 1000
cost_history = np.empty(shape=[1],dtype=float)

X = tf.placeholder(tf.float32,[None,n_dim])
Y = tf.placeholder(tf.float32,[None,1])
W = tf.Variable(tf.ones([n_dim,1]))

init = tf.initialize_all_variables()

y_ = tf.matmul(X, W)
cost = tf.reduce_mean(tf.square(y_ - Y))
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Now an important thing to note here is that nothing here was actually executed. Tensor flow objects are only executed when they are explicitly called. So we need to explicitly call it. Till then we need to define place holder for variables that will be a part of it.

So for example we need x,y and w for

y=W*x+b

Finally let us execute tensor flow

sess = tf.Session()
sess.run(init)

for epoch in range(training_epochs):
 sess.run(training_step,feed_dict={X:train_x,Y:train_y})
 cost_history = np.append(cost_history,
         sess.run(cost,feed_dict={X: train_x,Y: train_y}))

This will actually train the model and find the cost function.

You can find the code for this on git hub here.

If you are looking for something with a bigger data set , you can find the code for regression on the Boston data set using tensor flow here

 

Introduction to Tensor Flow

There is a certain hype about “Tensor Flow” as we all know about , and if we go to see why wont there be a hype about a package that google decides to freely release?

What is tensor flow and why did it come into existence?     Tensor flow is nothing but a simple computational package used for machine learning. In vague terms imagine a place where everything is in the form of a matrix, and you perform computations on them to get your result.

It came into existence especially to deal with media content , training a machine to learn from images or audio or video requires a special faster mechanism.Another amazing thing about tensor flow is that makes use of your GPU for training.

You can read more about tensor flow here

We will make use of tensor flow to do some complex things , but as always begin from scratch.

The first thing is installation of tensor flow

For Python ->

You can read installation instructions here. Note – If you are using windows, you may run into a problem. If you do try using docker.

For R->

You can read R installation instructions here.

All the tensor flow examples in this blog will be on python. However if you understand tensor flow in python you can easily implement it in R.

You can validate your installation of tensor flow , by using the following code

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!',name="My_Tensor",shape=(2,2))
sess = tf.Session()
print(sess.run(hello))

R Shiny Apps for Time Series

Like the name suggests shinyy**** .

Shiny is a new package from RStudio that makes it incredibly easy to build interactive web applications with R. For an introduction and live examples, visit the Shiny homepage.

Why shiny ?  Well for starters its free and simple to use and deploy. If you are planning to use shiny commercially , you will have to pay for hosting your apps, but for the rest its free and you can easily deploy your shiny apps online on https://www.shinyapps.io

So why is shiny useful to us, It can be used to make an interactive dashboard design or to allow a person to interact with R from the GUI, no coding for the end user involved. They are also highly dynamic and can be customised to tweak settings as the end user likes.

My problem, For several time series data sets I faced the problem of repetitively checking a few common things like if the data is stationary or not? , how does the data look like?, Does it require transformation and most importantly from a lazy mans perspective, will auto.arima do the trick 😛

I decided to automate this manual task via SHINY and demonstrate a small example.So what does my shiny app do?

  1. It accepts single column input from any text file that you feed in
  2. It will ask user if there is a header
  3. The start year , month and frequency of the data
  4. Using this information it will plot a PACF and ACF graph
  5. It will also execute auto.arima and plot the normal time series data, to get an understanding.

This is a small example and hence it is simple, however we could make much complicated things. However for any person performing time series this app just saved his precious time of doing non trivial work.

So to run a shiny app , we require to code two files, one for the UI and one for the back-end processing, ie ui.R and server.R

Both these files

library(shiny)

shiny offers an extensive tutorial on -> https://shiny.rstudio.com/

You can view my app on -> https://mohammedtopiwalla.shinyapps.io/arima_shiny/

My code is relatively simple to , you can view that on github at ->  https://github.com/mmd52/Arima_Shiny

to run the app you will need a text file with time series data, you can download a sample from -> https://github.com/mmd52/Arima_Shiny/blob/master/data.txt

 

Running Various Models on Pima Indian Diabetesdata set

EDA was done various inferences found , now we will run various models and verify whether predictions match with the inferences.

As I have mentioned in the previous post , my focus is on the code and inference , which you can find in the python notebooks or R files.

R
Model Accuracy Precision Recall Kappa AUC
Decion Tree 73.48 75.33 82.48 0.4368 0.727
Naïve Bayes 75.22 82 80.39 0.4489 0.723
KNN 73.91 86.67 76.47 0.3894 0.683
Logistic Regression 76.09 82.67 81.05 0.4683 0.732
SVM Simple 73.91 86.67 76.47 0.3894 0.683
SVM 10 Folds 73.04 82.67 77.5 0.388 0.6883
SVM Linear 10 Folds 78.26 88.67 80.12 0.4974 0.7371
Random Forest 76.52 84 80.77 0.4733 0.733
XGBOOST 77.83 91.61 77.06 0.4981 0.843
Python
Model Accuracy Precision Recall Kappa AUC
Decion Tree 72.73 73 73 0.388 0.7
Naïve Bayes 80.51 80 81 0.5689 0.78
KNN 70.99 70 71 0.337 0.66
Logistic Regression 74.45 74 74 0.3956 0.68
SVM Simple 73.16 73 73 0.4007 0.69
Random Forest 76.62 77 77 0.48 0.73
XGBOOST 79.22 79 79 0.526 0.76

As we can see from the above tables XGBOOST was the clear winner for both the languages.

The Code for Python you can find at -> https://github.com/mmd52/Pima_Python

The code for R you can find at -> https://github.com/mmd52/Pima_R

Exploratory Data Analysis

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots.

The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it.

So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot.

For one numeric and other factor bar plots seem like a good option.

And for two numeric variables we have out faithful scatter plot to the rescue.

In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code.

I strongly suggest you view the code below, which has inferences and a well documented structure.

You can download the data from

DATA-> https://github.com/mmd52/Pima_R (A file named as diabetes.csv is the one)

R Code ->  https://github.com/mmd52/Pima_R/blob/master/EDA.R (A fair warning to execute the EDA code in R you will first need to execute the https://github.com/mmd52/Pima_R/blob/master/Libraries.R and https://github.com/mmd52/Pima_R/blob/master/Data.R)

Python Code-> https://github.com/mmd52/Pima_Python/blob/master/EDA.ipynb (Its a Jupyter Notebook)

Models on UCI PIMA DataSet

The Idea behind using this data set from the UCI repository is not just running models, but deriving inferences that match to the real world.

This makes predictions we make all the more sensible and strong especially when we have understood the data set and have derived correct inferences from it which match our predictions.

Our approach to this data set will be to perform the following

  1. Exploratory data analysis while deriving inferences from it
  2. Using techniques like PCA and checking cor relationship between data
  3.  Running various models and making inferences from the predictions

We will do all of this in R , and in Python.

The data now provided by UCI ->

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Sc

Let us first begin to understand the problem first, and what better explain the problem then a short video which you can view from here -> https://youtu.be/pN4HqWRybwk

So from the video we understand that the PIMA Indian tribe has a gene which gets aggravated on eating food high with sugar. So UCI pima indian data set has a collection of data of females from the pima tribe. In the data set of 768 rows 268 of them have diabetes.

You can find the data set description here – > https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names

The problem statement is to correctly classify and predict if a female has diabetes or no. Thus its a classification Problem.

Good news for us is that the data set has no null or missing values and to top the cherry on our ice cream is completely numeric. Only the target variable outcome and pregnancies are factor variables. The remaining variables are continuous numeric variables.

Decision Tree and Interpretation on Telecom Data

We saw that logistic Regression was a bad model for our telecom churn analysis, that leaves us with Decision tree.

Again we have two data sets the original data and the over sampled data. We run decision tree model on both of them and compare our results.

So running decision tree on the normal data set yielded better results as compared to running on the over sampled data set

Accuracy Kappa Precision Recall Auc
Data 0.9482 0.7513 0.68421 0.90698 0.837
Over Sampled Data 0.8894 0.5274 0.5965 0.5862 0.7656

Unfortunately the decision tree plot was too big for me to put it in this post.

As decision tree is giving the highest level of accuracy , we will select it as the clear winner for our telecom churn analysis problem.

Another major advantage of decision tree is that it could be explained graphically very easily to the end business user on why a particular choice is being made.

You can find the code for decision tree here->

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Decision_Tree.R

This was a dummy database and may not have yielded the best results , but is a perfect exercise for practice.

Logistic Regression And Interpretation On Telecom Data

If you have read my previous posts, you may have understood how feature engineering was done and why we are running a logistic regression n this data.

It is essential to understand we have two train sets

  1. The original train set
  2. The over sampled train set

Running Logistic regression on the normal data set yielded the following results

#Accuracy 87.53%
#Kappa 0.275
#Precision 22.807%
#Recall 0.59091 %
#AUC 0.602

capture101

Now running logistic regression on the over sampled data yielded the following results

#Accuracy 84.71 %
#Kappa 0.3265
#Precision 0.40351
#Recall 0.42593
#AUC 0.660

capture102

From both the models we can see when we use auc as our metric the over sampled data is clearly the winner. Also we will rely on the second model more because the kappa value is higher and precision recall values are closer.

One massive problem thanks to null deviance we face is that our accuracy after running our best model is 84.71% ; And our accuracy by running no model and stating customer retained is 85.8%. Means our model is not as effective as we would think. This means we either should try feature engineering or a different model.

As this data is falsified could be that our accuracy will always be bad, but lets assume logistic yielded a good result, let us try to understand the equation then,

Coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) 8.318e+00 1.078e+00 7.713 1.23e-14 ***
StateAL 1.767e-02 5.858e-01 0.030 0.975929 
StateAR -6.454e-01 6.095e-01 -1.059 0.289611 
StateAZ 6.812e-01 6.998e-01 0.973 0.330355 
StateCA -2.015e+00 5.959e-01 -3.382 0.000719 ***
StateCO -5.204e-01 5.958e-01 -0.873 0.382438 
StateCT -9.923e-01 5.768e-01 -1.720 0.085364 . 
StateDC -7.558e-01 6.655e-01 -1.136 0.256046 
StateDE -8.036e-01 5.780e-01 -1.390 0.164413 
StateFL -5.576e-01 5.832e-01 -0.956 0.339026 
StateGA -8.233e-01 5.512e-01 -1.494 0.135263 
StateHI 3.252e-01 7.279e-01 0.447 0.655050 
StateIA 2.778e-02 7.161e-01 0.039 0.969051 
StateID -4.276e-01 5.647e-01 -0.757 0.448895 
StateIL -1.270e+00 5.720e-01 -2.220 0.026441 * 
StateIN -7.517e-01 5.884e-01 -1.278 0.201372 
StateKS -1.164e+00 5.425e-01 -2.145 0.031918 * 
StateKY -7.379e-01 6.068e-01 -1.216 0.223949 
StateLA -1.080e+00 5.963e-01 -1.811 0.070173 . 
StateMA -1.541e+00 5.610e-01 -2.746 0.006032 ** 
StateMD -1.164e+00 5.565e-01 -2.092 0.036455 * 
StateME -1.915e+00 5.471e-01 -3.500 0.000465 ***
StateMI -1.501e+00 5.746e-01 -2.612 0.009011 ** 
StateMN -8.528e-01 5.486e-01 -1.555 0.120064 
StateMO 2.519e-01 6.291e-01 0.400 0.688826 
StateMS -1.467e+00 5.614e-01 -2.613 0.008987 ** 
StateMT -1.447e+00 5.473e-01 -2.644 0.008181 ** 
StateNC -8.929e-01 5.817e-01 -1.535 0.124820 
StateND -6.750e-01 6.037e-01 -1.118 0.263526 
StateNE -6.011e-01 5.911e-01 -1.017 0.309221 
StateNH -8.939e-01 6.064e-01 -1.474 0.140435 
StateNJ -1.738e+00 5.556e-01 -3.128 0.001761 ** 
StateNM -1.151e+00 5.471e-01 -2.104 0.035366 * 
StateNV -1.757e+00 5.525e-01 -3.180 0.001473 ** 
StateNY -1.080e+00 5.650e-01 -1.912 0.055908 . 
StateOH -5.434e-01 5.577e-01 -0.974 0.329891 
StateOK -1.484e+00 5.837e-01 -2.543 0.011001 * 
StateOR -4.159e-01 5.561e-01 -0.748 0.454572 
StatePA -8.248e-01 6.262e-01 -1.317 0.187836 
StateRI 4.828e-01 6.553e-01 0.737 0.461271 
StateSC -1.327e+00 5.734e-01 -2.313 0.020695 * 
StateSD -1.419e+00 5.936e-01 -2.390 0.016838 * 
StateTN -2.747e-01 5.931e-01 -0.463 0.643201 
StateTX -2.148e+00 5.466e-01 -3.929 8.53e-05 ***
StateUT -7.398e-01 5.785e-01 -1.279 0.200914 
StateVA 7.518e-01 6.311e-01 1.191 0.233547 
StateVT -4.988e-01 5.869e-01 -0.850 0.395327 
StateWA -1.369e+00 5.698e-01 -2.402 0.016308 * 
StateWI -2.333e-01 5.906e-01 -0.395 0.692830 
StateWV -4.497e-01 5.560e-01 -0.809 0.418600 
StateWY -1.921e-01 5.780e-01 -0.332 0.739637 
Account_Length -1.719e-03 1.189e-03 -1.446 0.148198 
Area_Code 1.860e-03 1.085e-03 1.714 0.086489 . 
Phone_No -1.627e-07 1.687e-07 -0.964 0.334881 
International_Plan yes -2.516e+00 1.206e-01 -20.858 < 2e-16 ***
Voice_Mail_Plan yes -1.028e-01 1.447e-01 -0.710 0.477407 
No_Vmail_Messages -2.941e-03 5.303e-03 -0.555 0.579144 
Total_Day_minutes -4.437e+00 2.775e+00 -1.599 0.109815 
Total_Day_Calls 3.982e-05 2.389e-03 0.017 0.986701 
Total_Day_charge 2.603e+01 1.632e+01 1.595 0.110808 
Total_Eve_Minutes -1.862e+00 1.418e+00 -1.313 0.189311 
Total_Eve_Calls -4.211e-03 2.379e-03 -1.770 0.076674 . 
Total_Eve_Charge 2.182e+01 1.668e+01 1.308 0.190938 
Total_Night_Minutes 9.630e-01 7.453e-01 1.292 0.196293 
Total_Night_Calls -6.086e-04 2.392e-03 -0.254 0.799175 
Total_Night_Charge -2.143e+01 1.656e+01 -1.294 0.195715 
Total_Intl_Minutes 2.219e+00 4.482e+00 0.495 0.620579 
Total_Intl_Calls 1.075e-01 2.053e-02 5.233 1.67e-07 ***
Total_Intl_Charge -8.763e+00 1.660e+01 -0.528 0.597585 
No_CS_Calls -5.475e-01 3.540e-02 -15.466 < 2e-16 ***
---

Cant read it ? well think you just made this model and your boss calls up and asks you, there is a customer his state his NV his total calls, charges and duration is xyz , Will he leave the telecom operator? if yes please explain?

What will you say , well its easy you look at the above table and start. Every factor that your boss gave fits in the equation and you could quantitatively justify your answer. All of this thanks to the historical data.

For simplicity lets consider equation

y = 45    +  60*(age)
where y=salary
45=intercept

How would you interpret this equation, it obvious you would say as age increases , so does salary increase. right?

How ever think again and think hard this time, what if I told you age is 0? Now explain it to me? Im sure you understood here that a newborn cannot have a salary of 45 $ without doing anything. This is where business understanding or domain knowledge comes into play.

We should usually avoid explaining the intercept unless the business understanding , helps you to explain it. But this is a Gray area, so its better to avoid explaining it , then to make a mess out of it.

However imagine if this same equation was for a packet of wafers

y = 45 + 0.1(weight)

Here we could simply say that mean weight that should be in a packet of wafers is 45 gms, however that is not always true so a variance factor in the form of coefficients is added.

That is why intercept at some places could be explained and some places cannot be.

You can find the code for logistic regression Here ->

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Logistic_Regression.R

Determining Feature Importance For Telecom Data

We have a complete data set  -> Check

Feature engineering done -> Check

How many variables do we have?    20 variables

How many should we ideally use ?   Not more that 10 ideally

How to determine which variables to include and which not to ?   Its simple do Boruta!!

Whats Boruta?

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. You can read about it here ->  https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/   Analytics vidhya has given a pretty good explanation about it here.

Now keep one important thing in mind we have two train sets 1)Normal train set  2)Smote Train set.

So upon running boruta on the normal train set, boruta confirmed the variables International_Plan,Voice_Mail_Plan ,No_Vmail_Messages,Total_Day_minutes,
Total_Day_charge,Total_Eve_Minutes,Total_Eve_Charge,Total_Night_Minutes,
Total_Night_Charge , Total_Intl_Minutes,Total_Intl_Calls,Total_Intl_Charge,
No_CS_Calls as important.

And upon running Boruta on Smote data set, Boruta confirmed all the variables as important, you can find the boruta code below

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Boruta_Imp_FE.R

Feature Engineering On Telecom Data

Although the Telecom data provided by https://www.sgi.com/tech/mlc/db/ has no missing values , there is a landslide of class imbalance.

That is why the only thing we will concentrate in our feature engineering is eliminating class imbalance.

> summary(train$Customer_Left)
False True 
 2850 483

Its Visible that retained customers in our training set is 2850 and customer who left are 483. Because of this I will do oversampling on the customers who left to balance the data set.

Let us assume that I do not over sample , then by even not making any model I can simply say customer retained and still be right 85.8% of the time. In order to break this bias I use a package known as SMOTE(Synthetic minority oversampling technique ) you can read about the research paper published in the Journal of Artificial Intelligence Research 16 (2002) here -> https://www.jair.org/media/953/live-953-2037-jair.pdf

> train$Customer_Left<-as.numeric(train$Customer_Left)
> summary(as.factor(train$Customer_Left))
 1 2 
 483 2850 
> train$Customer_Left[train$Customer_Left==2]<-0
> summary(as.factor(train$Customer_Left))
 0 1 
2850 483  
#here false ->1
# true ->0
> train$Customer_Left<-as.factor(train$Customer_Left)
> ntrain<-SMOTE(Customer_Left~.,train,perc.over=200,k = 3)
> ntrain$Customer_Left<-as.factor(ntrain$Customer_Left)
> summary(ntrain$Customer_Left)
 0 1 
1932 1449

Now we have simply under sampled retained customers from 2850 to 1932 and over sampled customers who left the operator from 483 to 1449.

Now train has been manipulated hence I also had to manipulate test once.

You can see the complete code here ->  https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/FE.R

NOTE-> The libararies.R file consists code that loads packages needed and if they are not installed on your machine it will download and then install them.