One of the major problems that telecom operators face is customer retention. Because of which majority of the Telecom operators want to know which customer is most likely to leave them, so that they could immediately take certain actions like providing a discount or providing a customised plan, so that they could retain the customer.
However accuracy required while building a churn analysis model needs to be very high, imagine if our model has a accuracy of just 75% and the total number of customers who want to leave are just 5% , this leaves a margin of 20% of customers who were wrongly classified as customers who will leave the operator. If an operator has 10000 customers,And 2500 customers are predicted to leave , the operator may have to release lets assume a 1$ credit to all that’s a cost of 2500$, where as credits that required to be released was only for 5% of the customer’s that is a cost of 500$, hence the operator spent 2000$ for no reason. If the operator has high number of customers it would lead to a huge loss.
Coming to the data quotient, there is no freely available telecom data as far as I know available, however the website https://www.sgi.com/tech/mlc/db/ provides data for churn analysis, this data is not real but represents real world scenarios and is good from the perspective of understanding and learning.
The data on the website is classified into train and test has no NA’s means no feature engineering as such to be done before running models on it.
Now comes the question of which models to run on it. Some would say since we need very high accuracy hence we will run xgboost or random forest, however the downside we have here is that we cannot explain to the operator on what basis is XGBOOST or random forest determining why will the customer leave him. Even if we manage to explain its very complicated and will not be accepted.
Because of this we will have to take support on models that can be easily explained to the customer. This leaves us with two models for classification .i.e. customer leaves -> 0 or customer is retained -> 1. So the models are Logistic regression and decision tree.
Why Logistic Regression ? well because we can explain to the operator why customer is leaving him thanks to the logit equation.
Why Decision Tree? well because there is a neat flow of how our tree makes decision by breaking variables and deciding yes and no based on entropy and impurity.
Further in this post category I will show feature engineering to Running models, to interpretation.
The data available from the website is a bit complex to save to a CSV file so if you need you can download the train and test data from below.
Also explanation of variables is not provided as it is fairly simple.