Although the Telecom data provided by https://www.sgi.com/tech/mlc/db/ has no missing values , there is a landslide of class imbalance.
That is why the only thing we will concentrate in our feature engineering is eliminating class imbalance.
> summary(train$Customer_Left) False True 2850 483
Its Visible that retained customers in our training set is 2850 and customer who left are 483. Because of this I will do oversampling on the customers who left to balance the data set.
Let us assume that I do not over sample , then by even not making any model I can simply say customer retained and still be right 85.8% of the time. In order to break this bias I use a package known as SMOTE(Synthetic minority oversampling technique ) you can read about the research paper published in the Journal of Artificial Intelligence Research 16 (2002) here -> https://www.jair.org/media/953/live-953-2037-jair.pdf
> train$Customer_Left<-as.numeric(train$Customer_Left) > summary(as.factor(train$Customer_Left)) 1 2 483 2850 > train$Customer_Left[train$Customer_Left==2]<-0 > summary(as.factor(train$Customer_Left)) 0 1 2850 483 #here false ->1 # true ->0 > train$Customer_Left<-as.factor(train$Customer_Left) > ntrain<-SMOTE(Customer_Left~.,train,perc.over=200,k = 3) > ntrain$Customer_Left<-as.factor(ntrain$Customer_Left) > summary(ntrain$Customer_Left) 0 1 1932 1449
Now we have simply under sampled retained customers from 2850 to 1932 and over sampled customers who left the operator from 483 to 1449.
Now train has been manipulated hence I also had to manipulate test once.
You can see the complete code here -> https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/FE.R
NOTE-> The libararies.R file consists code that loads packages needed and if they are not installed on your machine it will download and then install them.