Tag: dataanalytics

Image Classification – Pastas

What do you do when you are new in Italy and unable to determine what kind of pasta is being served to you?

Solutions is simple – Put on your nerd cap and let the machine distinguish it for you!

So what data do we have ? We have data of 4 different kinds of pastas, for each type of pasta we have around 1000 images

  • Ragu
  • Carbonara
  • Lasagna
  • gnocchi

This slideshow requires JavaScript.

Our task is to load the images, convert it into a matrix of numbers (possibly change the shape of the matrix by using some engineering tools) and classify the pastas.

First of all you can download the data from here

The complete code is here

So what do we need to

  1. First we need to read all the images in python, and to this we need to iterate over the food file
  2. Once the images are loaded we convert them into numerical matrices (After all they are numeric pixel values that represent a particular color)
  3. We also shape the data by removing some unnecessary pixel values
  4. Great so now we have our data – time to split it in train and testing
  5. Finally we run different kinds of svm models however we cannot exceed 48% accuracy 😦
  6. But no reason to be upset Artificial neural networks to the rescue
  7. What are ANN – Artificial neural networks are one of the main tools used in machine learning. As the “neural” part of their name suggests, they are brain-inspired systems which are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. They are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize.

ANN’s were able to give us 60% accuracy , which is a significant increase from SVM’s.

However in order to boost our accuracy, now we try to convert our images from color to gray scale and try to highlight any particular unique shape or feature of the image. This process is known as Histogram of Gradient.

However this didn’t help us get better results.

One shortcut solution we could have used is to use a pre trained neural network, train it on our data and get better results, like vgg16

So what are you waiting for – Buon Appetito!!!!

Advertisements

Text Mining – Whats Cooking?

Whats cooking?
An interesting data set from kaggle where we have each row as a unique dish belonging to one cuisine and and each dish with its set of ingredients.

The complete code is here

For example –
{
"id": 10259,
"cuisine": "greek",
"ingredients": [
"romaine lettuce",
"black olives",
"grape tomatoes",
"garlic",
"pepper",
"purple onion",
"seasoning",
"garbanzo beans",
"feta cheese crumbles"
]
}

There are 20 ingredients here, so based on the ingredients can we predict the cuisine? Yes we can, but unlike other classification problems, we have just one column ingredients (A text column).

The complete code is here

Its time to remove the weapon of text mining, In this example there are 2 techniques, I would like to highlight

  1. Count Vectorizer – What this method does is essentially make a gigantic matrix of all the words recording their counts for every dish. (Essentially binarizing the data frame, Yes you guessed it right like a sparse matrix)
  2. Term Frequency – Inverse document frequency – What this does is like the count vectorizer, but rather than taking counts for the same dish it checks how many times that ingredient repeats across all the dishes. This highlights a dish if it uses some unique or rare ingredient

That being done we simply need to pass this into our machine learning models and view the results.

But wait ! not so quickly

  1. Do we have any missing values – NO!
  2. Do we have any strange data type – YES json (We need to flatten it)
  3. Do we understand the data and get insights from it ? – NOPE, we need to do EDA

Some EDA Results-

 

This slideshow requires JavaScript.

Finally we run the following algorithms(Click the link for the code)-

  1. Decision tree
  2. Random Forests
  3. Logistic Regressions
  4. XGBOOST
  5. SVM
  6. Neural Networks

We get the highest accuracy from SVM – however its quite time consuming. Logistic regression gave a good accuracy in the shortest time.

The complete code is here

Classification Models – Employee attrition

Modeling for prediction

In order to find a model which could help with the prediction process we ran several data mining models

17

 

From the previous results its clear that decision tree stole the show!

However lets think practically

  • It is often required to explain the business why we think a person could leave, in this case we need a model whose output we can explain. In our case a decision tree or logistic regression
  • Sometimes HR would just like to run our model on random data sets , so its not always possible to Balance our datasets using techniques like smote
  • Our model should just be able to predict better than random but imagine the cost of entertaining an employee who was not going to leave but our system tagged him – This is a future improvement for our model
  • XGBoost model created a nice ensemble of trees for us, whose accuracy could increase more than the decision tree if we get more data

 

We successfully created an early warning system  which immediately tells the Human Resources department if an employee is prune to leave or not.

We achieved this early warning system based on several data mining techniques in order to be  very accurate on supervised classification modelling

EDA and Data Cleaning

Well the data is here

So we first start with EDA

  • Data is imbalance by class we have 83% who have not left the company and 17% who have left the company
  • The age group of IBM employees in this data set is concentrated between 25-45 years
  • Attrition is more common in the younger age groups and it is more likely with females As Expected it is more common amongst single Employees
  • People who leave the company get lower opportunities to travel the company
  • People having very high education tend to have lower attrition
  • The correlation plot was as expected
  • Link to eda workbook in python is here
  • From the Tableau plots we can conclude that below mentioned category are having higher attrition rate:
    • Sales department among all the departments
    • Human Resources and Technical Degree in Education
    • Single’s in Marital status (Will not use this due to GDPR)
    • Male in comparison to females in Gender (Will not use this due to GDPR)
    • Employee with job satisfaction value 1
    • Job level 1 in job level
    • Life balance having value 1
    • Employee staying at distant place
    • Environment Satisfaction value 1

 

First of all we have categorical data and if we want to run machine learning algorithms in python we need to be able to convert categorical variables(nominal) to dummy variables and ordinal ones to integer values.

Once we are done with that we need to embrace the fact that our data is biased so in order to equalize the class balance we make use of the Synthetic minority oversampling technique (SMOTE). You can google about it.

The code file is located here for your reference ->   https://github.com/mmd52/3XDataMining/blob/master/DataCleaning_And_Smote.ipynb

Implementation using Shiny

Well we have understood the business problem and approach to solve it so now lets implement in a GUI way with the help of shiny.

Shiny is a great platform in R to make neat dashboards and with the introduction of shiny dashboard things are even neater.

While coming to the modelling end due to the constraint of speed I have used only a simple linear regression and am plotting output of linear regression, but if we were to make this a business application we could implement all the models using the framework of this code.

If a real user was to use it he may have to wait more than hour to see his result but the value and simplicity it derives for a user is tremendous.

You can view the shiny app here , and you can download the data to run this app from here

I have uploaded the code on git which you can view here

 

 

Data – Cash Forecasting

Now the data we are talking about is usually highly confidential and one of the major reasons why we will be working with dummy data.

The data we have is that of a single ATM, for various time periods.

Our fields are Holiday (Binary) where 1 indicates an holiday and 0 indicates a normal working day.

Up time defines how long was the ATM up. At times ATM’s could not be functional because of power outage, network connectivity or physical issues.

Peak period (Binary) where 1 indicates peak period and 0 indicates non peak period.

Dispense cash is dispensed cash against a particular day.

This slideshow requires JavaScript.

Now in general for a normal ATM the weekly trend would be a spike in dispense on the start of weeks mostly Monday and a drop during end of the week ie Saturdays and Sundays.

For a monthly trend the spike would be at the beginning and end of the month and a drop somewhere in between. And this seems logical to, think of salaried people as an example, they get their salaries at the end of the month and probably use it to plan their month ahead, or business men would like to pay their dues or salaries to their employees at the end of the month and therefore withdraw cash.

Above you can see a few basic insights regarding cash dispense.

You can download the dummy data set that I have created from here

Cash Forecasting – Understanding difficulty associated with

Every ATM that is placed anywhere would have different dispensation trends then the other. An ATM in a rural area would have a different and smaller dispensation trend as compared to a busy suburban area. This means a different model for every ATM. WAIT WHAT?? So much over head, no way anyways doing that right. Well yes no ones going to do that.

Solution to this problem is classifying ATM’s based on their locality, Imagine creating various bands, where Band1 is a metropolitan area where users dispense cash frequently from and band5 being the lowest where cash dispense is the lowest.

Now that solves a very minor problem, but still if we see cash dispense trends they would waiver in spite of being in the same band. HMM.. problem still not solved. Now we categorise ATM’s based on their age and on average how much they dispense.

So for our case study we will only be considering ATM’s that have an age greater than 6 months and on average dispense between 0 – 100,000 $.

Cash Forecasting – Overview

How do ATM’s in general work is a great question to ask? Well banks at times prefer not to manage their ATM’s as it involves a lot of overhead such as transportation of cash, maintenance of ATM machines, rent and most importantly security.

In order to avoid this over head a lot of banks outsource this task. The companies who overtake this responsibility , make their revenue based on every transaction made. Say for every non cash transaction from the ATM managed by them they get x$ and for every cash transaction they get y$  where y>x .

So why do we need to predict cash ?? well these companies rent a place, put their ATM’s at that place keep a service engineer to maintain that machine and pump enough security, but where they need to be careful is interest cost. What interest cost? lets say for today’s date I decided to keep 100$ in my ATM, I would borrow this money from a bank, to whom I would pay interest every day for the cash that is not withdrawn by the customer’s.

The obvious solution for this is to load ATM’s with the smallest amount of money possible, however this leads to two problems, First is loss of revenue from a potential customer, and second one is brand loss, and brand loss is very bad.

That means we do not want to load to much money to avoid paying interest cost on idle money, and neither do we want to put to less in order to avoid loss of revenue and brand loss. In order to find this perfect balance we need to create a forecasting model on how much money to load in the ATM’s, in order to make the business profitable.

One underlying constraint is transportation. We cannot transport and load money in ATM’s on a daily basis to avoid transportation costs, that is why transportation will happen only once in two to three days.

R Shiny Apps for Time Series

Like the name suggests shinyy**** .

Shiny is a new package from RStudio that makes it incredibly easy to build interactive web applications with R. For an introduction and live examples, visit the Shiny homepage.

Why shiny ?  Well for starters its free and simple to use and deploy. If you are planning to use shiny commercially , you will have to pay for hosting your apps, but for the rest its free and you can easily deploy your shiny apps online on https://www.shinyapps.io

So why is shiny useful to us, It can be used to make an interactive dashboard design or to allow a person to interact with R from the GUI, no coding for the end user involved. They are also highly dynamic and can be customised to tweak settings as the end user likes.

My problem, For several time series data sets I faced the problem of repetitively checking a few common things like if the data is stationary or not? , how does the data look like?, Does it require transformation and most importantly from a lazy mans perspective, will auto.arima do the trick 😛

I decided to automate this manual task via SHINY and demonstrate a small example.So what does my shiny app do?

  1. It accepts single column input from any text file that you feed in
  2. It will ask user if there is a header
  3. The start year , month and frequency of the data
  4. Using this information it will plot a PACF and ACF graph
  5. It will also execute auto.arima and plot the normal time series data, to get an understanding.

This is a small example and hence it is simple, however we could make much complicated things. However for any person performing time series this app just saved his precious time of doing non trivial work.

So to run a shiny app , we require to code two files, one for the UI and one for the back-end processing, ie ui.R and server.R

Both these files

library(shiny)

shiny offers an extensive tutorial on -> https://shiny.rstudio.com/

You can view my app on -> https://mohammedtopiwalla.shinyapps.io/arima_shiny/

My code is relatively simple to , you can view that on github at ->  https://github.com/mmd52/Arima_Shiny

to run the app you will need a text file with time series data, you can download a sample from -> https://github.com/mmd52/Arima_Shiny/blob/master/data.txt

 

Running Various Models on Pima Indian Diabetesdata set

EDA was done various inferences found , now we will run various models and verify whether predictions match with the inferences.

As I have mentioned in the previous post , my focus is on the code and inference , which you can find in the python notebooks or R files.

R
Model Accuracy Precision Recall Kappa AUC
Decion Tree 73.48 75.33 82.48 0.4368 0.727
Naïve Bayes 75.22 82 80.39 0.4489 0.723
KNN 73.91 86.67 76.47 0.3894 0.683
Logistic Regression 76.09 82.67 81.05 0.4683 0.732
SVM Simple 73.91 86.67 76.47 0.3894 0.683
SVM 10 Folds 73.04 82.67 77.5 0.388 0.6883
SVM Linear 10 Folds 78.26 88.67 80.12 0.4974 0.7371
Random Forest 76.52 84 80.77 0.4733 0.733
XGBOOST 77.83 91.61 77.06 0.4981 0.843
Python
Model Accuracy Precision Recall Kappa AUC
Decion Tree 72.73 73 73 0.388 0.7
Naïve Bayes 80.51 80 81 0.5689 0.78
KNN 70.99 70 71 0.337 0.66
Logistic Regression 74.45 74 74 0.3956 0.68
SVM Simple 73.16 73 73 0.4007 0.69
Random Forest 76.62 77 77 0.48 0.73
XGBOOST 79.22 79 79 0.526 0.76

As we can see from the above tables XGBOOST was the clear winner for both the languages.

The Code for Python you can find at -> https://github.com/mmd52/Pima_Python

The code for R you can find at -> https://github.com/mmd52/Pima_R