Logistic Regression And Interpretation On Telecom Data

If you have read my previous posts, you may have understood how feature engineering was done and why we are running a logistic regression n this data.

It is essential to understand we have two train sets

  1. The original train set
  2. The over sampled train set

Running Logistic regression on the normal data set yielded the following results

#Accuracy 87.53%
#Kappa 0.275
#Precision 22.807%
#Recall 0.59091 %
#AUC 0.602

capture101

Now running logistic regression on the over sampled data yielded the following results

#Accuracy 84.71 %
#Kappa 0.3265
#Precision 0.40351
#Recall 0.42593
#AUC 0.660

capture102

From both the models we can see when we use auc as our metric the over sampled data is clearly the winner. Also we will rely on the second model more because the kappa value is higher and precision recall values are closer.

One massive problem thanks to null deviance we face is that our accuracy after running our best model is 84.71% ; And our accuracy by running no model and stating customer retained is 85.8%. Means our model is not as effective as we would think. This means we either should try feature engineering or a different model.

As this data is falsified could be that our accuracy will always be bad, but lets assume logistic yielded a good result, let us try to understand the equation then,

Coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) 8.318e+00 1.078e+00 7.713 1.23e-14 ***
StateAL 1.767e-02 5.858e-01 0.030 0.975929 
StateAR -6.454e-01 6.095e-01 -1.059 0.289611 
StateAZ 6.812e-01 6.998e-01 0.973 0.330355 
StateCA -2.015e+00 5.959e-01 -3.382 0.000719 ***
StateCO -5.204e-01 5.958e-01 -0.873 0.382438 
StateCT -9.923e-01 5.768e-01 -1.720 0.085364 . 
StateDC -7.558e-01 6.655e-01 -1.136 0.256046 
StateDE -8.036e-01 5.780e-01 -1.390 0.164413 
StateFL -5.576e-01 5.832e-01 -0.956 0.339026 
StateGA -8.233e-01 5.512e-01 -1.494 0.135263 
StateHI 3.252e-01 7.279e-01 0.447 0.655050 
StateIA 2.778e-02 7.161e-01 0.039 0.969051 
StateID -4.276e-01 5.647e-01 -0.757 0.448895 
StateIL -1.270e+00 5.720e-01 -2.220 0.026441 * 
StateIN -7.517e-01 5.884e-01 -1.278 0.201372 
StateKS -1.164e+00 5.425e-01 -2.145 0.031918 * 
StateKY -7.379e-01 6.068e-01 -1.216 0.223949 
StateLA -1.080e+00 5.963e-01 -1.811 0.070173 . 
StateMA -1.541e+00 5.610e-01 -2.746 0.006032 ** 
StateMD -1.164e+00 5.565e-01 -2.092 0.036455 * 
StateME -1.915e+00 5.471e-01 -3.500 0.000465 ***
StateMI -1.501e+00 5.746e-01 -2.612 0.009011 ** 
StateMN -8.528e-01 5.486e-01 -1.555 0.120064 
StateMO 2.519e-01 6.291e-01 0.400 0.688826 
StateMS -1.467e+00 5.614e-01 -2.613 0.008987 ** 
StateMT -1.447e+00 5.473e-01 -2.644 0.008181 ** 
StateNC -8.929e-01 5.817e-01 -1.535 0.124820 
StateND -6.750e-01 6.037e-01 -1.118 0.263526 
StateNE -6.011e-01 5.911e-01 -1.017 0.309221 
StateNH -8.939e-01 6.064e-01 -1.474 0.140435 
StateNJ -1.738e+00 5.556e-01 -3.128 0.001761 ** 
StateNM -1.151e+00 5.471e-01 -2.104 0.035366 * 
StateNV -1.757e+00 5.525e-01 -3.180 0.001473 ** 
StateNY -1.080e+00 5.650e-01 -1.912 0.055908 . 
StateOH -5.434e-01 5.577e-01 -0.974 0.329891 
StateOK -1.484e+00 5.837e-01 -2.543 0.011001 * 
StateOR -4.159e-01 5.561e-01 -0.748 0.454572 
StatePA -8.248e-01 6.262e-01 -1.317 0.187836 
StateRI 4.828e-01 6.553e-01 0.737 0.461271 
StateSC -1.327e+00 5.734e-01 -2.313 0.020695 * 
StateSD -1.419e+00 5.936e-01 -2.390 0.016838 * 
StateTN -2.747e-01 5.931e-01 -0.463 0.643201 
StateTX -2.148e+00 5.466e-01 -3.929 8.53e-05 ***
StateUT -7.398e-01 5.785e-01 -1.279 0.200914 
StateVA 7.518e-01 6.311e-01 1.191 0.233547 
StateVT -4.988e-01 5.869e-01 -0.850 0.395327 
StateWA -1.369e+00 5.698e-01 -2.402 0.016308 * 
StateWI -2.333e-01 5.906e-01 -0.395 0.692830 
StateWV -4.497e-01 5.560e-01 -0.809 0.418600 
StateWY -1.921e-01 5.780e-01 -0.332 0.739637 
Account_Length -1.719e-03 1.189e-03 -1.446 0.148198 
Area_Code 1.860e-03 1.085e-03 1.714 0.086489 . 
Phone_No -1.627e-07 1.687e-07 -0.964 0.334881 
International_Plan yes -2.516e+00 1.206e-01 -20.858 < 2e-16 ***
Voice_Mail_Plan yes -1.028e-01 1.447e-01 -0.710 0.477407 
No_Vmail_Messages -2.941e-03 5.303e-03 -0.555 0.579144 
Total_Day_minutes -4.437e+00 2.775e+00 -1.599 0.109815 
Total_Day_Calls 3.982e-05 2.389e-03 0.017 0.986701 
Total_Day_charge 2.603e+01 1.632e+01 1.595 0.110808 
Total_Eve_Minutes -1.862e+00 1.418e+00 -1.313 0.189311 
Total_Eve_Calls -4.211e-03 2.379e-03 -1.770 0.076674 . 
Total_Eve_Charge 2.182e+01 1.668e+01 1.308 0.190938 
Total_Night_Minutes 9.630e-01 7.453e-01 1.292 0.196293 
Total_Night_Calls -6.086e-04 2.392e-03 -0.254 0.799175 
Total_Night_Charge -2.143e+01 1.656e+01 -1.294 0.195715 
Total_Intl_Minutes 2.219e+00 4.482e+00 0.495 0.620579 
Total_Intl_Calls 1.075e-01 2.053e-02 5.233 1.67e-07 ***
Total_Intl_Charge -8.763e+00 1.660e+01 -0.528 0.597585 
No_CS_Calls -5.475e-01 3.540e-02 -15.466 < 2e-16 ***
---

Cant read it ? well think you just made this model and your boss calls up and asks you, there is a customer his state his NV his total calls, charges and duration is xyz , Will he leave the telecom operator? if yes please explain?

What will you say , well its easy you look at the above table and start. Every factor that your boss gave fits in the equation and you could quantitatively justify your answer. All of this thanks to the historical data.

For simplicity lets consider equation

y = 45    +  60*(age)
where y=salary
45=intercept

How would you interpret this equation, it obvious you would say as age increases , so does salary increase. right?

How ever think again and think hard this time, what if I told you age is 0? Now explain it to me? Im sure you understood here that a newborn cannot have a salary of 45 $ without doing anything. This is where business understanding or domain knowledge comes into play.

We should usually avoid explaining the intercept unless the business understanding , helps you to explain it. But this is a Gray area, so its better to avoid explaining it , then to make a mess out of it.

However imagine if this same equation was for a packet of wafers

y = 45 + 0.1(weight)

Here we could simply say that mean weight that should be in a packet of wafers is 45 gms, however that is not always true so a variance factor in the form of coefficients is added.

That is why intercept at some places could be explained and some places cannot be.

You can find the code for logistic regression Here ->

https://github.com/mmd52/Telecom_Churn_Analysis/blob/master/Logistic_Regression.R

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s