If you have seen my previous posts you may have seen that i wasn’t unable to achieve a really high accuracy with simple models.
The fault was not of any of the models. The main fault was of the data itself, it was not ready. That is why in any project we spend 80% of the time cleaning the data , making it sane and understanding. At this place domain Knowledge comes out to be very useful.
This is the exactly the reason why i decided to do some data pre-processing.
What did i do –
1)Understood what every column plays in determining a persons income
2)Decided whether to group some objects or no?
3)Whether to bin a columns contents in a group or no
4)Save it as categorical and numerical.
5)Thats all i did at this moment and my accuracy in random forest jumped greatly
In my next post i will explain in detail every line of my code which you can find on github at link https://github.com/mmd52/UCI_ADULT_DATSET_CATEGORICAL_PROJECT/blob/master/NewData_And_FeatureEngineering.R