Revised Approach To UCI ADULT DATA SET

If you have seen the posts in the uci adult data set section, you may have realised I am not going above 86% with accuracy.

An important thing I learnt the hard way was to never eliminate rows in a data set. Its fine to eliminate columns having NA values above 30% but never eliminate rows.

Because of this I had to redo my feature engineering. So how to fix my missing NA values , well what i did was , I opened my data set in excel and converted all ‘?’ mark values to ‘NA’

This would make feature engineering more simple. The next step is to identify columns with missing values, and see if their missing values were greater than 30% in totality.

In our case type_employer had 1836 missing values

occupation had a further 1843 missing values

and country had 583 missing values.

So what I did was , I predicted the missing values with the help of other independent variables(No I didnt add income here for predicting them). Once my model was made i used it to replace the missing values in the columns. Thus i had a clean data set with no missing values.

I admit the predictions were not that great , but they were tolerable.

Because of which when I ran the following models my accuracy skyrocketed

  1. Logistic Regression -> 85.38%
  2. Random Forest(Excluding variable country)  -> 87.11%
  3. SVM -> 85.8%
  4. XGBOOST with 10 folds -> 87.08%

Here i would rely on XG BOOST model inspite of having accuracy less than random forest because of kcross validation.

You can refer to my feature engineering code and models run code with their csv files form my github link .

The comments in the R file explain the code quite well.

  1. Libraries.R file help load the libraries required for functions. It consists of a check load method which will install a function in case its not installed and then load it code->
  2.  The FE file will predict NA values using random forest and create a csv file with no missing values which we can use for our prediction purposes(ie to predict income) code->
  3. Logistic Regression is self explanatory see for yourself code->
  4. SVM code->
  5. Random Forest is a bit tricky here . The problem with random forest was that when run with variable country its prediction fell down to a whooping 85% , but without it , random forest got boosted to 87.11%. Because of this i have excluded country in my prediction. code->
  6. XG BOOST Simply Predicted like a dream perfect k cross validation. For  XGBOOST i had to convert all values to numeric and after making a matrix I simply broke it into training and testing. Training had 70% of the values and testing had the remaining 30% of the values.I made use of 10 folds in the function.Thus leading to an accuracy of 87.02% . CODE->
  7. The last method I wanted to try was KNN code ->

With this I conclude the predictions on UCI Adult Data Set , You could still perhaps try to increase the accuracy by stacking , or tune a Svm Model.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s