Tag: uci

Running Various Models on Pima Indian Diabetesdata set

EDA was done various inferences found , now we will run various models and verify whether predictions match with the inferences.

As I have mentioned in the previous post , my focus is on the code and inference , which you can find in the python notebooks or R files.

R
Model Accuracy Precision Recall Kappa AUC
Decion Tree 73.48 75.33 82.48 0.4368 0.727
Naïve Bayes 75.22 82 80.39 0.4489 0.723
KNN 73.91 86.67 76.47 0.3894 0.683
Logistic Regression 76.09 82.67 81.05 0.4683 0.732
SVM Simple 73.91 86.67 76.47 0.3894 0.683
SVM 10 Folds 73.04 82.67 77.5 0.388 0.6883
SVM Linear 10 Folds 78.26 88.67 80.12 0.4974 0.7371
Random Forest 76.52 84 80.77 0.4733 0.733
XGBOOST 77.83 91.61 77.06 0.4981 0.843
Python
Model Accuracy Precision Recall Kappa AUC
Decion Tree 72.73 73 73 0.388 0.7
Naïve Bayes 80.51 80 81 0.5689 0.78
KNN 70.99 70 71 0.337 0.66
Logistic Regression 74.45 74 74 0.3956 0.68
SVM Simple 73.16 73 73 0.4007 0.69
Random Forest 76.62 77 77 0.48 0.73
XGBOOST 79.22 79 79 0.526 0.76

As we can see from the above tables XGBOOST was the clear winner for both the languages.

The Code for Python you can find at -> https://github.com/mmd52/Pima_Python

The code for R you can find at -> https://github.com/mmd52/Pima_R

Advertisements

Exploratory Data Analysis

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots.

The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it.

So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot.

For one numeric and other factor bar plots seem like a good option.

And for two numeric variables we have out faithful scatter plot to the rescue.

In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code.

I strongly suggest you view the code below, which has inferences and a well documented structure.

You can download the data from

DATA-> https://github.com/mmd52/Pima_R (A file named as diabetes.csv is the one)

R Code ->  https://github.com/mmd52/Pima_R/blob/master/EDA.R (A fair warning to execute the EDA code in R you will first need to execute the https://github.com/mmd52/Pima_R/blob/master/Libraries.R and https://github.com/mmd52/Pima_R/blob/master/Data.R)

Python Code-> https://github.com/mmd52/Pima_Python/blob/master/EDA.ipynb (Its a Jupyter Notebook)

Paper On Using Various Classifier Algorithms and Scaling Up Accuracy Of A Model

machine-learning-on-uci-adult-data-set-using-various-classifier-algorithms-and-scaling-up-the-accuracy-using-extreme-gradient-boosting

Revised Approach To UCI ADULT DATA SET

If you have seen the posts in the uci adult data set section, you may have realised I am not going above 86% with accuracy.

An important thing I learnt the hard way was to never eliminate rows in a data set. Its fine to eliminate columns having NA values above 30% but never eliminate rows.

Because of this I had to redo my feature engineering. So how to fix my missing NA values , well what i did was , I opened my data set in excel and converted all ‘?’ mark values to ‘NA’

This would make feature engineering more simple. The next step is to identify columns with missing values, and see if their missing values were greater than 30% in totality.

In our case type_employer had 1836 missing values

occupation had a further 1843 missing values

and country had 583 missing values.

So what I did was , I predicted the missing values with the help of other independent variables(No I didnt add income here for predicting them). Once my model was made i used it to replace the missing values in the columns. Thus i had a clean data set with no missing values.

I admit the predictions were not that great , but they were tolerable.

Because of which when I ran the following models my accuracy skyrocketed

  1. Logistic Regression -> 85.38%
  2. Random Forest(Excluding variable country)  -> 87.11%
  3. SVM -> 85.8%
  4. XGBOOST with 10 folds -> 87.08%

Continue reading “Revised Approach To UCI ADULT DATA SET”

Support Vector Machine

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyper-plane. In other words, given labelled training data (supervised learning), the algorithm outputs an optimal hyper-plane which categorises new examples.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

When data are not labelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The clustering algorithm which provides an improvement to the support vector machines is called support vector clustering and is often used in industrial applications either when data are not labelled or when only some data are labelled as a preprocessing for a classification pass.

SVM can do some amazing predictions , for example when you use tune function specify a range of cost and epsilon values. The tune function automatically picks up the best SVM model for us with the least possible error.

Support vector machine algorithms can be very computational intensive and in our case the are with the large number of data rows. It took my machine 10 hours to process the model completely.

 

Extreme Gradient Boosting

The term ‘Boosting’ refers to a family of algorithms which converts weak learner’s to strong learners.

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

  1. Email has only one image file (promotional image), It’s a SPAM
  2. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
  3. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learners.To convert a weak learner to a strong learner, we’ll combine the prediction of each weak learner to form one definitive strong learner.

Continue reading “Extreme Gradient Boosting”

Stacking on Numeric Data Sets

As is human nature we always want to get a better prediction , if possible some would pray for a full 100%.

Anyways ignoring the Hypothetical, We have run a number of common models like

1)Logistic Regression

2)Random Forest

3)SVM

so now the question arises, whether we can give it a tad bit push for a better accuracy?

Continue reading “Stacking on Numeric Data Sets”

Logistic Regression,Random Forest,SVM on Numerical Data Set

So its been a long time. We have finally got the data just as how we want it.

Great so data is ready and we already have a bit of knowledge on logistic Regression and Random Forest.

So going ahead first with Logistic Regression-

logit.fit=glm(income~.,family=binomial(logit),data=training_data)

on executing this magic line I lie with an accuracy of 80% . Naaaaah , not what we wanted.

so going ahead with Random Forest

bestmtry <- tuneRF(training_data[,-14], as.factor(training_data[,14]),
ntreeTry=100, stepFactor=1.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
rf.fit <- randomForest(income ~ ., data=training_data,
mtry=4, ntree=1000, keep.forest=TRUE, importance=TRUE, test=x_test)

Yes !!

this returned finally an 86 % , it looks like we are doing great. We finally did it!!!!!!

Trying out SVM now.

But wait what is SVM- Support vector machines?

Think of all the data points plotted in space that we cant visualise.But imagine if we had 2D data, then in very vague terms SVM would make lines for us that would help us clearly classify whether a data point belongs to the group 50K and above or 50K and below.

So SVM has hyperplanes these planes are calculated in such a way that they are equidistant from both the classes.

In SVM a plane with maximum margin is a good plane and a plane with minimum margin is a bad plane.

With that said you can find the code for random forest and logistic regression here ->

code

and for svm here->

SVM code

 

Feature Engineering / Data Pre-Processing Code walk Through

# @Author Mohammed 25-12-2016
source(“Libraries.r”)

#Downloading adult income data set from UCI
print(“=================Downloading Data=========================”)

data = read.table(“http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data&#8221;,
sep=”,”,header=F,
col.names=c(“age”, “type_employer”, “fnlwgt”, “education”,
“education_num”,”marital”, “occupation”, “relationship”, “race”,”sex”,
“capital_gain”, “capital_loss”, “hr_per_week”,”country”, “income”),
fill=FALSE,strip.white=T)

print(“=====================Data Loaded===============================”)

#Data Dictionary
View(head(data,20))

#So first things first lets understand each column
#1)Age-> Like i need to explian :p (continuous).
#2)workclass-> Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
#3)fnlwgt-> continuous (have not quite understood this).
#4)education-> Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
#5)education-num->continuous (A numeruic representation of education).
#6)marital-status-> Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
#7)occupation-> Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
#8)relationship-> Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
#9)race-> White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
#10)sex-> Female, Male.
#11)capital-gain-> continuous.
#12)capital-loss-> continuous.
#13)hours-per-week-> continuous.
#14)native-country-> United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
#As we can clearly see from above education num is a numeric
#representation of education so we will go ahead and delete it
data[[“education_num”]]=NULL
########### Binning
#What we will do here is convert our raw numeric fields into bins
#wel do this for the ones it makes sense like age
#=========================binning for Age==================
#Age<18=child 18<Age<=30=Young Adult
#30<Age<=60=adult Age>60=Senior
ndata<-data
for (i in 1:nrow(ndata)){
if(ndata[i,1]<=18){
ndata[i,1]=”child”
}else if (ndata[i,1]>18 && ndata[i,1]<=30){
ndata[i,1]=”young_adult”
}else if (ndata[i,1]>30 && ndata[i,1]<=60){
ndata[i,1]=”adult”
}else if (ndata[i,1]>61 ){
ndata[i,1]=”senior”
}
}

#=========================binning for Hours worked per week=======
#HW<=25=Part Time 25<HW<=40=Full Time
#40<HW<=60=Over_Time HW>61=Time To Much
npdata<-ndata
for (i in 1:nrow(npdata)){
if(npdata[i,12]<=25){
npdata[i,12]=”Part_Time”
}else if (npdata[i,12]>25 && npdata[i,11]<=40){
npdata[i,12]=”Full_Time”
}else if (npdata[i,12]>40 && npdata[i,11]<=60){
npdata[i,12]=”Over_Time”
}else if (npdata[i,12]>61 ){
npdata[i,12]=”TIME_TOMUCH”
}
}

data<-npdata

######################### Factor to Character
#Convert factor variables to character
#We do this for ease of processing
fctr.cols <- sapply(data, is.factor)
data[, fctr.cols] <- sapply(data[, fctr.cols], as.character)
######################## Missing Value Treatment

is.na(data) = data==’?’
is.na(data) = data==’ ?’
data = na.omit(data)

#######################################################################
train_test<-data

#Now in Employer Type we have so many , options more complexity right
#So lets make just reduce them by using func GSUB
#So now my complexity is reduced
train_test$type_employer = gsub(“^Federal-gov”,”Federal-Govt”,train_test$type_employer)
train_test$type_employer = gsub(“^Local-gov”,”Other-Govt”,train_test$type_employer)
train_test$type_employer = gsub(“^State-gov”,”Other-Govt”,train_test$type_employer)
train_test$type_employer = gsub(“^Private”,”Private”,train_test$type_employer)
train_test$type_employer = gsub(“^Self-emp-inc”,”Self-Employed”,train_test$type_employer)
train_test$type_employer = gsub(“^Self-emp-not-inc”,”Self-Employed”,train_test$type_employer)
train_test$type_employer = gsub(“^Without-pay”,”Not-Working”,train_test$type_employer)
train_test$type_employer = gsub(“^Never-worked”,”Not-Working”,train_test$type_employer)

#Similarly here
train_test$occupation = gsub(“^Adm-clerical”,”Admin”,train_test$occupation)
train_test$occupation = gsub(“^Armed-Forces”,”Military”,train_test$occupation)
train_test$occupation = gsub(“^Craft-repair”,”Blue-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Exec-managerial”,”White-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Farming-fishing”,”Blue-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Handlers-cleaners”,”Blue-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Machine-op-inspct”,”Blue-Collar”,train_test$occupation)
train_test$occupation = gsub(“^Other-service”,”Service”,train_test$occupation)
train_test$occupation = gsub(“^Priv-house-serv”,”Service”,train_test$occupation)
train_test$occupation = gsub(“^Prof-specialty”,”Professional”,train_test$occupation)
train_test$occupation = gsub(“^Protective-serv”,”Other-Occupations”,train_test$occupation)
train_test$occupation = gsub(“^Sales”,”Sales”,train_test$occupation)
train_test$occupation = gsub(“^Tech-support”,”Other-Occupations”,train_test$occupation)
train_test$occupation = gsub(“^Transport-moving”,”Blue-Collar”,train_test$occupation)

#You are right counrty too
train_test$country[train_test$country==”Cambodia”] = “SE-Asia”
train_test$country[train_test$country==”Canada”] = “British-Commonwealth”
train_test$country[train_test$country==”China”] = “China”
train_test$country[train_test$country==”Columbia”] = “South-America”
train_test$country[train_test$country==”Cuba”] = “Other”
train_test$country[train_test$country==”Dominican-Republic”] = “Latin-America”
train_test$country[train_test$country==”Ecuador”] = “South-America”
train_test$country[train_test$country==”El-Salvador”] = “South-America”
train_test$country[train_test$country==”England”] = “British-Commonwealth”
train_test$country[train_test$country==”France”] = “Euro_1″
train_test$country[train_test$country==”Germany”] = “Euro_1″
train_test$country[train_test$country==”Greece”] = “Euro_2″
train_test$country[train_test$country==”Guatemala”] = “Latin-America”
train_test$country[train_test$country==”Haiti”] = “Latin-America”
train_test$country[train_test$country==”Holand-Netherlands”] = “Euro_1″
train_test$country[train_test$country==”Honduras”] = “Latin-America”
train_test$country[train_test$country==”Hong”] = “China”
train_test$country[train_test$country==”Hungary”] = “Euro_2″
train_test$country[train_test$country==”India”] = “British-Commonwealth”
train_test$country[train_test$country==”Iran”] = “Other”
train_test$country[train_test$country==”Ireland”] = “British-Commonwealth”
train_test$country[train_test$country==”Italy”] = “Euro_1″
train_test$country[train_test$country==”Jamaica”] = “Latin-America”
train_test$country[train_test$country==”Japan”] = “Other”
train_test$country[train_test$country==”Laos”] = “SE-Asia”
train_test$country[train_test$country==”Mexico”] = “Latin-America”
train_test$country[train_test$country==”Nicaragua”] = “Latin-America”
train_test$country[train_test$country==”Outlying-US(Guam-USVI-etc)”] = “Latin-America”
train_test$country[train_test$country==”Peru”] = “South-America”
train_test$country[train_test$country==”Philippines”] = “SE-Asia”
train_test$country[train_test$country==”Poland”] = “Euro_2″
train_test$country[train_test$country==”Portugal”] = “Euro_2″
train_test$country[train_test$country==”Puerto-Rico”] = “Latin-America”
train_test$country[train_test$country==”Scotland”] = “British-Commonwealth”
train_test$country[train_test$country==”South”] = “Euro_2″
train_test$country[train_test$country==”Taiwan”] = “China”
train_test$country[train_test$country==”Thailand”] = “SE-Asia”
train_test$country[train_test$country==”Trinadad&Tobago”] = “Latin-America”
train_test$country[train_test$country==”United-States”] = “United-States”
train_test$country[train_test$country==”Vietnam”] = “SE-Asia”
train_test$country[train_test$country==”Yugoslavia”] = “Euro_2”
#Education is most important
train_test$education = gsub(“^10th”,”Dropout”,train_test$education)
train_test$education = gsub(“^11th”,”Dropout”,train_test$education)
train_test$education = gsub(“^12th”,”Dropout”,train_test$education)
train_test$education = gsub(“^1st-4th”,”Dropout”,train_test$education)
train_test$education = gsub(“^5th-6th”,”Dropout”,train_test$education)
train_test$education = gsub(“^7th-8th”,”Dropout”,train_test$education)
train_test$education = gsub(“^9th”,”Dropout”,train_test$education)
train_test$education = gsub(“^Assoc-acdm”,”Associates”,train_test$education)
train_test$education = gsub(“^Assoc-voc”,”Associates”,train_test$education)
train_test$education = gsub(“^Bachelors”,”Bachelors”,train_test$education)
train_test$education = gsub(“^Doctorate”,”Doctorate”,train_test$education)
train_test$education = gsub(“^HS-Grad”,”HS-Graduate”,train_test$education)
train_test$education = gsub(“^Masters”,”Masters”,train_test$education)
train_test$education = gsub(“^Preschool”,”Dropout”,train_test$education)
train_test$education = gsub(“^Prof-school”,”Prof-School”,train_test$education)
train_test$education = gsub(“^Some-college”,”HS-Graduate”,train_test$education)

# Similarly marital
train_test$marital[train_test$marital==”Never-married”] = “Never-Married”
train_test$marital[train_test$marital==”Married-AF-spouse”] = “Married”
train_test$marital[train_test$marital==”Married-civ-spouse”] = “Married”
train_test$marital[train_test$marital==”Married-spouse-absent”] = “Not-Married”
train_test$marital[train_test$marital==”Separated”] = “Not-Married”
train_test$marital[train_test$marital==”Divorced”] = “Not-Married”
train_test$marital[train_test$marital==”Widowed”] = “Widowed”

#Leaving race behind is racist no? 😛
train_test$race[train_test$race==”White”] = “White”
train_test$race[train_test$race==”Black”] = “Black”
train_test$race[train_test$race==”Amer-Indian-Eskimo”] = “Amer-Indian”
train_test$race[train_test$race==”Asian-Pac-Islander”] = “Asian”
train_test$race[train_test$race==”Other”] = “Other”

#Getting income below or above 50K to High and low
train_test$income[train_test$income==”>50K”]=”High”
train_test$income[train_test$income==”<=50K”]=”Low”

######################################################################
write.csv(x=train_test,file=”ADULT_USI_FE_CATEGORICAL.csv”)

######################################################################
#Doing Label Encoding
#This converts all categorical things to numeric
features = names(train_test[,-14])
for (f in features) {
if (class(train_test[[f]])==”character”) {
#cat(“VARIABLE : “,f,”\n”)
levels <- unique(train_test[[f]])
train_test[[f]] <- as.numeric(as.integer(factor(train_test[[f]], levels=levels)))
}
}
write.csv(x=train_test,file=”ADULT_USI_FE_Numerical.csv”)
print(“=====================Loading datasets complete.=================”)

 

You can find this project at->

https://github.com/mmd52/UCI_ADULT_DATSET_CATEGORICAL_PROJECT

Feature Engineering / Data Pre-Processing

If you have seen my previous posts you may have seen that i wasn’t unable to achieve a really high accuracy with simple models.

The fault was not of any of the models. The main fault was of the data itself, it was not ready. That is why in any project we spend 80% of the time cleaning the data , making it sane and understanding. At this place domain Knowledge comes out to be very useful.

Continue reading “Feature Engineering / Data Pre-Processing”