Feature Engineering / Data Pre-Processing

If you have seen my previous posts you may have seen that i wasn’t unable to achieve a really high accuracy with simple models.

The fault was not of any of the models. The main fault was of the data itself, it was not ready. That is why in any project we spend 80% of the time cleaning the data , making it sane and understanding. At this place domain Knowledge comes out to be very useful.

This is the exactly the reason why i decided to do some data pre-processing.

What did i do –

1)Understood what every column plays in determining a persons income

2)Decided whether to group some objects or no?

3)Whether to bin a columns contents in a group or no

4)Save it as categorical and numerical.

5)Thats all i did at this moment and my accuracy in random forest jumped greatly

In my next post i will explain in detail every line of my code which you can find on github at link https://github.com/mmd52/UCI_ADULT_DATSET_CATEGORICAL_PROJECT/blob/master/NewData_And_FeatureEngineering.R


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s