Implementation using Shiny

Well we have understood the business problem and approach to solve it so now lets implement in a GUI way with the help of shiny.

Shiny is a great platform in R to make neat dashboards and with the introduction of shiny dashboard things are even neater.

While coming to the modelling end due to the constraint of speed I have used only a simple linear regression and am plotting output of linear regression, but if we were to make this a business application we could implement all the models using the framework of this code.

If a real user was to use it he may have to wait more than hour to see his result but the value and simplicity it derives for a user is tremendous.

You can view the shiny app here , and you can download the data to run this app from here

I have uploaded the code on git which you can view here

 

 

Advertisements

Data – Cash Forecasting

Now the data we are talking about is usually highly confidential and one of the major reasons why we will be working with dummy data.

The data we have is that of a single ATM, for various time periods.

Our fields are Holiday (Binary) where 1 indicates an holiday and 0 indicates a normal working day.

Up time defines how long was the ATM up. At times ATM’s could not be functional because of power outage, network connectivity or physical issues.

Peak period (Binary) where 1 indicates peak period and 0 indicates non peak period.

Dispense cash is dispensed cash against a particular day.

This slideshow requires JavaScript.

Now in general for a normal ATM the weekly trend would be a spike in dispense on the start of weeks mostly Monday and a drop during end of the week ie Saturdays and Sundays.

For a monthly trend the spike would be at the beginning and end of the month and a drop somewhere in between. And this seems logical to, think of salaried people as an example, they get their salaries at the end of the month and probably use it to plan their month ahead, or business men would like to pay their dues or salaries to their employees at the end of the month and therefore withdraw cash.

Above you can see a few basic insights regarding cash dispense.

You can download the dummy data set that I have created from here

Cash Forecasting – Understanding difficulty associated with

Every ATM that is placed anywhere would have different dispensation trends then the other. An ATM in a rural area would have a different and smaller dispensation trend as compared to a busy suburban area. This means a different model for every ATM. WAIT WHAT?? So much over head, no way anyways doing that right. Well yes no ones going to do that.

Solution to this problem is classifying ATM’s based on their locality, Imagine creating various bands, where Band1 is a metropolitan area where users dispense cash frequently from and band5 being the lowest where cash dispense is the lowest.

Now that solves a very minor problem, but still if we see cash dispense trends they would waiver in spite of being in the same band. HMM.. problem still not solved. Now we categorise ATM’s based on their age and on average how much they dispense.

So for our case study we will only be considering ATM’s that have an age greater than 6 months and on average dispense between 0 – 100,000 $.

Cash Forecasting – Overview

How do ATM’s in general work is a great question to ask? Well banks at times prefer not to manage their ATM’s as it involves a lot of overhead such as transportation of cash, maintenance of ATM machines, rent and most importantly security.

In order to avoid this over head a lot of banks outsource this task. The companies who overtake this responsibility , make their revenue based on every transaction made. Say for every non cash transaction from the ATM managed by them they get x$ and for every cash transaction they get y$  where y>x .

So why do we need to predict cash ?? well these companies rent a place, put their ATM’s at that place keep a service engineer to maintain that machine and pump enough security, but where they need to be careful is interest cost. What interest cost? lets say for today’s date I decided to keep 100$ in my ATM, I would borrow this money from a bank, to whom I would pay interest every day for the cash that is not withdrawn by the customer’s.

The obvious solution for this is to load ATM’s with the smallest amount of money possible, however this leads to two problems, First is loss of revenue from a potential customer, and second one is brand loss, and brand loss is very bad.

That means we do not want to load to much money to avoid paying interest cost on idle money, and neither do we want to put to less in order to avoid loss of revenue and brand loss. In order to find this perfect balance we need to create a forecasting model on how much money to load in the ATM’s, in order to make the business profitable.

One underlying constraint is transportation. We cannot transport and load money in ATM’s on a daily basis to avoid transportation costs, that is why transportation will happen only once in two to three days.

Logistic Regression

If you have followed my previous post you may have understood some common things to create before running any kind of model in tensor flow.

  1. Number of iterations
  2. Learning rate
  3. Cost Function

Now just like simple linear regression we want to first understand how logistic regression is working in tensor flow because of which we will take a very simple data set say 2 independent variables and one dependant variable(1 or 0).

Now lets accept one complicated thing. Some data points for certain variables could have very high values as compared to another variable, Hence its important to tackle this problem head on by normalising our entire data set.

Now we look at the problem systematically and define a few functions to get it up and working.

def read_dataset(filePath,delimiter=','):
 data = genfromtxt(filePath, delimiter=delimiter)
 features, labels = np.array(data[:,0:-1], dtype=float), 
                                     np.array(data[:,-1],dtype=int)
 return features,labels

def feature_normalize(features):
 mu = np.mean(features,axis=0)
 sigma = np.std(features,axis=0)
 return (features - mu)/sigma

def append_bias_reshape(features):
 n_training_samples, n_dim = features.shape[0], features.shape[1]
 features = np.reshape(np.c_[np.ones(n_training_samples),features],
                                         [n_training_samples,n_dim + 1])
 return features

def one_hot_encode(labels):
 n_labels = len(labels)
 n_unique_labels = len(np.unique(labels))
 one_hot_encode = np.zeros((n_labels,n_unique_labels))
 one_hot_encode[np.arange(n_labels), labels] = 1
 return one_hot_encode

def plot_points(features,labels):
 normal = np.where(labels == 0)
 outliers = np.where(labels == 1)
 fig = plt.figure(figsize=(10,8))
 plt.plot(features[normal ,0],features[normal ,1],'bx')
 plt.plot(features[outliers,0],features[outliers ,1],'ro')
 plt.xlabel('Latency (ms)')
 plt.ylabel('Throughput (mb/s)')
 plt.show()

Now our basic framework to build a model is set , next is to define number of iterations and cost function

learning_rate = 0.00001
training_epochs = 100

X = tf.placeholder(tf.float32,[None,n_dim])
Y = tf.placeholder(tf.float32,[None,2])
W = tf.Variable(tf.ones([n_dim,2]))
init = tf.initialize_all_variables()

y_ = tf.nn.sigmoid(tf.matmul(X,W))
cost_function = tf.nn.l2_loss(y_-Y,name="Squared_Error_Cost") 
#tf.reduce_mean(tf.reduce_sum((-Y * tf.log(y_)) - ((1 - Y) * tf.log(1 - y_)), reduction_indices=[1]))
#tf.nn.l2_loss(activation_OP-yGold, name="squared_error_cost")
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)

After this we just need to run the model.

You can view my complete code for simple Logistic Regression here

Data set for the above code here

and logistic regression using the iris data set here

 

 

Linear Regression using Tensor Flow

The best thing to do when starting something new is to start doing something simple.

In our case lets do linear regression in which we will try to predict the price of a house with its size. Yes we will use some falsified data but that’s fine.

Well first things first, every thing in tensor flow is in the form of an array, so we begin initialising our data as arrays

#FOR LR
Area=np.array([[987],[452],[876],[201],[349],[195],[1000],[1501],[555],[724],
[652],[328],[895]])
price=np.array([[1974],[904],[1752],[402],[698],[390],[2000],[3002],[1110],
[1448],[1304],[656],[1790]])

Okay so we have area and prices that is our x and y both in the form of a numpy array.

Now the next step is a very crucial step, in this we will determine

  1. Number of iterations
  2. Learning rate
  3. Cost Function

Why the above 3 steps? well we do it to find the smallest error. We make use of Gradient Descent

learning_rate = 0.01
training_epochs = 1000
cost_history = np.empty(shape=[1],dtype=float)

X = tf.placeholder(tf.float32,[None,n_dim])
Y = tf.placeholder(tf.float32,[None,1])
W = tf.Variable(tf.ones([n_dim,1]))

init = tf.initialize_all_variables()

y_ = tf.matmul(X, W)
cost = tf.reduce_mean(tf.square(y_ - Y))
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Now an important thing to note here is that nothing here was actually executed. Tensor flow objects are only executed when they are explicitly called. So we need to explicitly call it. Till then we need to define place holder for variables that will be a part of it.

So for example we need x,y and w for

y=W*x+b

Finally let us execute tensor flow

sess = tf.Session()
sess.run(init)

for epoch in range(training_epochs):
 sess.run(training_step,feed_dict={X:train_x,Y:train_y})
 cost_history = np.append(cost_history,
         sess.run(cost,feed_dict={X: train_x,Y: train_y}))

This will actually train the model and find the cost function.

You can find the code for this on git hub here.

If you are looking for something with a bigger data set , you can find the code for regression on the Boston data set using tensor flow here

 

Introduction to Tensor Flow

There is a certain hype about “Tensor Flow” as we all know about , and if we go to see why wont there be a hype about a package that google decides to freely release?

What is tensor flow and why did it come into existence?     Tensor flow is nothing but a simple computational package used for machine learning. In vague terms imagine a place where everything is in the form of a matrix, and you perform computations on them to get your result.

It came into existence especially to deal with media content , training a machine to learn from images or audio or video requires a special faster mechanism.Another amazing thing about tensor flow is that makes use of your GPU for training.

You can read more about tensor flow here

We will make use of tensor flow to do some complex things , but as always begin from scratch.

The first thing is installation of tensor flow

For Python ->

You can read installation instructions here. Note – If you are using windows, you may run into a problem. If you do try using docker.

For R->

You can read R installation instructions here.

All the tensor flow examples in this blog will be on python. However if you understand tensor flow in python you can easily implement it in R.

You can validate your installation of tensor flow , by using the following code

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!',name="My_Tensor",shape=(2,2))
sess = tf.Session()
print(sess.run(hello))

R Shiny Apps for Time Series

Like the name suggests shinyy**** .

Shiny is a new package from RStudio that makes it incredibly easy to build interactive web applications with R. For an introduction and live examples, visit the Shiny homepage.

Why shiny ?  Well for starters its free and simple to use and deploy. If you are planning to use shiny commercially , you will have to pay for hosting your apps, but for the rest its free and you can easily deploy your shiny apps online on https://www.shinyapps.io

So why is shiny useful to us, It can be used to make an interactive dashboard design or to allow a person to interact with R from the GUI, no coding for the end user involved. They are also highly dynamic and can be customised to tweak settings as the end user likes.

My problem, For several time series data sets I faced the problem of repetitively checking a few common things like if the data is stationary or not? , how does the data look like?, Does it require transformation and most importantly from a lazy mans perspective, will auto.arima do the trick 😛

I decided to automate this manual task via SHINY and demonstrate a small example.So what does my shiny app do?

  1. It accepts single column input from any text file that you feed in
  2. It will ask user if there is a header
  3. The start year , month and frequency of the data
  4. Using this information it will plot a PACF and ACF graph
  5. It will also execute auto.arima and plot the normal time series data, to get an understanding.

This is a small example and hence it is simple, however we could make much complicated things. However for any person performing time series this app just saved his precious time of doing non trivial work.

So to run a shiny app , we require to code two files, one for the UI and one for the back-end processing, ie ui.R and server.R

Both these files

library(shiny)

shiny offers an extensive tutorial on -> https://shiny.rstudio.com/

You can view my app on -> https://mohammedtopiwalla.shinyapps.io/arima_shiny/

My code is relatively simple to , you can view that on github at ->  https://github.com/mmd52/Arima_Shiny

to run the app you will need a text file with time series data, you can download a sample from -> https://github.com/mmd52/Arima_Shiny/blob/master/data.txt

 

Running Various Models on Pima Indian Diabetesdata set

EDA was done various inferences found , now we will run various models and verify whether predictions match with the inferences.

As I have mentioned in the previous post , my focus is on the code and inference , which you can find in the python notebooks or R files.

R
Model Accuracy Precision Recall Kappa AUC
Decion Tree 73.48 75.33 82.48 0.4368 0.727
Naïve Bayes 75.22 82 80.39 0.4489 0.723
KNN 73.91 86.67 76.47 0.3894 0.683
Logistic Regression 76.09 82.67 81.05 0.4683 0.732
SVM Simple 73.91 86.67 76.47 0.3894 0.683
SVM 10 Folds 73.04 82.67 77.5 0.388 0.6883
SVM Linear 10 Folds 78.26 88.67 80.12 0.4974 0.7371
Random Forest 76.52 84 80.77 0.4733 0.733
XGBOOST 77.83 91.61 77.06 0.4981 0.843
Python
Model Accuracy Precision Recall Kappa AUC
Decion Tree 72.73 73 73 0.388 0.7
Naïve Bayes 80.51 80 81 0.5689 0.78
KNN 70.99 70 71 0.337 0.66
Logistic Regression 74.45 74 74 0.3956 0.68
SVM Simple 73.16 73 73 0.4007 0.69
Random Forest 76.62 77 77 0.48 0.73
XGBOOST 79.22 79 79 0.526 0.76

As we can see from the above tables XGBOOST was the clear winner for both the languages.

The Code for Python you can find at -> https://github.com/mmd52/Pima_Python

The code for R you can find at -> https://github.com/mmd52/Pima_R

Exploratory Data Analysis

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots.

The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it.

So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot.

For one numeric and other factor bar plots seem like a good option.

And for two numeric variables we have out faithful scatter plot to the rescue.

In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code.

I strongly suggest you view the code below, which has inferences and a well documented structure.

You can download the data from

DATA-> https://github.com/mmd52/Pima_R (A file named as diabetes.csv is the one)

R Code ->  https://github.com/mmd52/Pima_R/blob/master/EDA.R (A fair warning to execute the EDA code in R you will first need to execute the https://github.com/mmd52/Pima_R/blob/master/Libraries.R and https://github.com/mmd52/Pima_R/blob/master/Data.R)

Python Code-> https://github.com/mmd52/Pima_Python/blob/master/EDA.ipynb (Its a Jupyter Notebook)