Exploratory Data Analysis

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots.

The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it.

So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot.

For one numeric and other factor bar plots seem like a good option.

And for two numeric variables we have out faithful scatter plot to the rescue.

In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code.

I strongly suggest you view the code below, which has inferences and a well documented structure.

You can download the data from

DATA-> https://github.com/mmd52/Pima_R (A file named as diabetes.csv is the one)

R Code ->  https://github.com/mmd52/Pima_R/blob/master/EDA.R (A fair warning to execute the EDA code in R you will first need to execute the https://github.com/mmd52/Pima_R/blob/master/Libraries.R and https://github.com/mmd52/Pima_R/blob/master/Data.R)

Python Code-> https://github.com/mmd52/Pima_Python/blob/master/EDA.ipynb (Its a Jupyter Notebook)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s