The Idea behind using this data set from the UCI repository is not just running models, but deriving inferences that match to the real world.
This makes predictions we make all the more sensible and strong especially when we have understood the data set and have derived correct inferences from it which match our predictions.
Our approach to this data set will be to perform the following
- Exploratory data analysis while deriving inferences from it
- Using techniques like PCA and checking cor relationship between data
- Running various models and making inferences from the predictions
We will do all of this in R , and in Python.
The data now provided by UCI ->
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Sc
Let us first begin to understand the problem first, and what better explain the problem then a short video which you can view from here -> https://youtu.be/pN4HqWRybwk
So from the video we understand that the PIMA Indian tribe has a gene which gets aggravated on eating food high with sugar. So UCI pima indian data set has a collection of data of females from the pima tribe. In the data set of 768 rows 268 of them have diabetes.
You can find the data set description here – > https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names
The problem statement is to correctly classify and predict if a female has diabetes or no. Thus its a classification Problem.
Good news for us is that the data set has no null or missing values and to top the cherry on our ice cream is completely numeric. Only the target variable outcome and pregnancies are factor variables. The remaining variables are continuous numeric variables.