If you have read my previous posts, you may have understood how feature engineering was done and why we are running a logistic regression n this data.
It is essential to understand we have two train sets
- The original train set
- The over sampled train set
Running Logistic regression on the normal data set yielded the following results
#Recall 0.59091 %
Now running logistic regression on the over sampled data yielded the following results
#Accuracy 84.71 %
From both the models we can see when we use auc as our metric the over sampled data is clearly the winner. Also we will rely on the second model more because the kappa value is higher and precision recall values are closer.
One massive problem thanks to null deviance we face is that our accuracy after running our best model is 84.71% ; And our accuracy by running no model and stating customer retained is 85.8%. Means our model is not as effective as we would think. This means we either should try feature engineering or a different model.
As this data is falsified could be that our accuracy will always be bad, but lets assume logistic yielded a good result, let us try to understand the equation then,
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.318e+00 1.078e+00 7.713 1.23e-14 ***
StateAL 1.767e-02 5.858e-01 0.030 0.975929
StateAR -6.454e-01 6.095e-01 -1.059 0.289611
StateAZ 6.812e-01 6.998e-01 0.973 0.330355
StateCA -2.015e+00 5.959e-01 -3.382 0.000719 ***
StateCO -5.204e-01 5.958e-01 -0.873 0.382438
StateCT -9.923e-01 5.768e-01 -1.720 0.085364 .
StateDC -7.558e-01 6.655e-01 -1.136 0.256046
StateDE -8.036e-01 5.780e-01 -1.390 0.164413
StateFL -5.576e-01 5.832e-01 -0.956 0.339026
StateGA -8.233e-01 5.512e-01 -1.494 0.135263
StateHI 3.252e-01 7.279e-01 0.447 0.655050
StateIA 2.778e-02 7.161e-01 0.039 0.969051
StateID -4.276e-01 5.647e-01 -0.757 0.448895
StateIL -1.270e+00 5.720e-01 -2.220 0.026441 *
StateIN -7.517e-01 5.884e-01 -1.278 0.201372
StateKS -1.164e+00 5.425e-01 -2.145 0.031918 *
StateKY -7.379e-01 6.068e-01 -1.216 0.223949
StateLA -1.080e+00 5.963e-01 -1.811 0.070173 .
StateMA -1.541e+00 5.610e-01 -2.746 0.006032 **
StateMD -1.164e+00 5.565e-01 -2.092 0.036455 *
StateME -1.915e+00 5.471e-01 -3.500 0.000465 ***
StateMI -1.501e+00 5.746e-01 -2.612 0.009011 **
StateMN -8.528e-01 5.486e-01 -1.555 0.120064
StateMO 2.519e-01 6.291e-01 0.400 0.688826
StateMS -1.467e+00 5.614e-01 -2.613 0.008987 **
StateMT -1.447e+00 5.473e-01 -2.644 0.008181 **
StateNC -8.929e-01 5.817e-01 -1.535 0.124820
StateND -6.750e-01 6.037e-01 -1.118 0.263526
StateNE -6.011e-01 5.911e-01 -1.017 0.309221
StateNH -8.939e-01 6.064e-01 -1.474 0.140435
StateNJ -1.738e+00 5.556e-01 -3.128 0.001761 **
StateNM -1.151e+00 5.471e-01 -2.104 0.035366 *
StateNV -1.757e+00 5.525e-01 -3.180 0.001473 **
StateNY -1.080e+00 5.650e-01 -1.912 0.055908 .
StateOH -5.434e-01 5.577e-01 -0.974 0.329891
StateOK -1.484e+00 5.837e-01 -2.543 0.011001 *
StateOR -4.159e-01 5.561e-01 -0.748 0.454572
StatePA -8.248e-01 6.262e-01 -1.317 0.187836
StateRI 4.828e-01 6.553e-01 0.737 0.461271
StateSC -1.327e+00 5.734e-01 -2.313 0.020695 *
StateSD -1.419e+00 5.936e-01 -2.390 0.016838 *
StateTN -2.747e-01 5.931e-01 -0.463 0.643201
StateTX -2.148e+00 5.466e-01 -3.929 8.53e-05 ***
StateUT -7.398e-01 5.785e-01 -1.279 0.200914
StateVA 7.518e-01 6.311e-01 1.191 0.233547
StateVT -4.988e-01 5.869e-01 -0.850 0.395327
StateWA -1.369e+00 5.698e-01 -2.402 0.016308 *
StateWI -2.333e-01 5.906e-01 -0.395 0.692830
StateWV -4.497e-01 5.560e-01 -0.809 0.418600
StateWY -1.921e-01 5.780e-01 -0.332 0.739637
Account_Length -1.719e-03 1.189e-03 -1.446 0.148198
Area_Code 1.860e-03 1.085e-03 1.714 0.086489 .
Phone_No -1.627e-07 1.687e-07 -0.964 0.334881
International_Plan yes -2.516e+00 1.206e-01 -20.858 < 2e-16 ***
Voice_Mail_Plan yes -1.028e-01 1.447e-01 -0.710 0.477407
No_Vmail_Messages -2.941e-03 5.303e-03 -0.555 0.579144
Total_Day_minutes -4.437e+00 2.775e+00 -1.599 0.109815
Total_Day_Calls 3.982e-05 2.389e-03 0.017 0.986701
Total_Day_charge 2.603e+01 1.632e+01 1.595 0.110808
Total_Eve_Minutes -1.862e+00 1.418e+00 -1.313 0.189311
Total_Eve_Calls -4.211e-03 2.379e-03 -1.770 0.076674 .
Total_Eve_Charge 2.182e+01 1.668e+01 1.308 0.190938
Total_Night_Minutes 9.630e-01 7.453e-01 1.292 0.196293
Total_Night_Calls -6.086e-04 2.392e-03 -0.254 0.799175
Total_Night_Charge -2.143e+01 1.656e+01 -1.294 0.195715
Total_Intl_Minutes 2.219e+00 4.482e+00 0.495 0.620579
Total_Intl_Calls 1.075e-01 2.053e-02 5.233 1.67e-07 ***
Total_Intl_Charge -8.763e+00 1.660e+01 -0.528 0.597585
No_CS_Calls -5.475e-01 3.540e-02 -15.466 < 2e-16 ***
Cant read it ? well think you just made this model and your boss calls up and asks you, there is a customer his state his NV his total calls, charges and duration is xyz , Will he leave the telecom operator? if yes please explain?
What will you say , well its easy you look at the above table and start. Every factor that your boss gave fits in the equation and you could quantitatively justify your answer. All of this thanks to the historical data.
For simplicity lets consider equation
y = 45 + 60*(age)
How would you interpret this equation, it obvious you would say as age increases , so does salary increase. right?
How ever think again and think hard this time, what if I told you age is 0? Now explain it to me? Im sure you understood here that a newborn cannot have a salary of 45 $ without doing anything. This is where business understanding or domain knowledge comes into play.
We should usually avoid explaining the intercept unless the business understanding , helps you to explain it. But this is a Gray area, so its better to avoid explaining it , then to make a mess out of it.
However imagine if this same equation was for a packet of wafers
y = 45 + 0.1(weight)
Here we could simply say that mean weight that should be in a packet of wafers is 45 gms, however that is not always true so a variance factor in the form of coefficients is added.
That is why intercept at some places could be explained and some places cannot be.
You can find the code for logistic regression Here ->