• Scrubbed and processed outliers in demographic data from the US Census Bureau for 263.4 million residents of 1994
• Designed supervised learning Nave Bayes, Logit, Decision Tree, and Random Forest models to predict the proclivity of a family with an annual income of more than $50,000 for a multinational banking enterprise
• Determined that Random Forest had the best accuracy (~93%), precision, and sensitivity from Confusion Matrix & ROC; the anticipated results were used in a $25 million direct marketing effort
• Deployed K-means, KNN, and Neural Networks to identify individuals who are more likely to default on loans in the future. To compare the effectiveness of various machine learning models, ROC and Accuracy statistics were evaluated
2. EXECUTIVE SUMMARY
• Our banking enterprise has called upon us to do an analysis of 30,000 customers to see if we are
able to determine an important prediction: Loan Defaults
• The data comes from this research: Yeh, I. C., & Lien, C. H. (2009). Expert Systems with
Applications, 36(2), 2473-2480
• Through our analysis, we were able to create strong segmentation models that made predictions with
high accuracies (ANN~ 81.72%)
4. SLICE & DICE
• Total of 30,000 customers with the majority of Females
• 11,888 Male customers of which 2,873 have defaulted (24.16%)
• 18,112 Female customers of which 3,763 have defaulted (20.77%)
Marital Status Defaults
Married 13659
Single 15964
Other 323
2873 3763
9015
14349
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Male (11,888) Female (18,112)
Demographics (30,000)
Defaults No Defaults
Table A: Defaults on basis of marital status
Table B: Demographic distribution of sex
5. VISUALIZATIONS
Chart 1: BILL_AMT1 distribution against Males | Peak - 4743
Chart 2: BILL_AMT1 distribution against Females | Peak -
7958
Chart 3: Age against
Defaults
• BILL_AMT1 for both sexes is positively skewed
• Defaults peak at the mean age of 61.6
7. SIMPLE KNN (K=35)
# of rows: No (predicted) Yes
(Predicted)
No (Actual) 6751 280
Yes (Actual) 1420 549
Accuracy 81.1 %
Misclassification Rat
e
18.9 %
True Positive Rate 0.279
False Positive Rate 0.721
Specificity 0.96
Precision 0.662
Prevalence 0.218
Confusion Matrix
Data is portioned in a 70 - 30 split for model building purposes
Class Statistics
Graph: ROC Curve for Simple KNN Model (k=35), AUC = 0.7427
8. K-MEANS (2 CLUSTERS, K= 27)
# of
rows:
No
(predicted
)
Yes
(Predicted)
No
(Actual)
4279 210
Yes
(Actual) 869 329
Accuracy 81.03 %
Misclassification Rat
e
18.98 %
True Positive Rate 0.275
False Positive Rate 0.725
Specificity 0.953
Precision 0.61
Prevalence 0.21
# of
rows:
No
(predicted
)
Yes
(Predicted)
No
(Actual)
2454 66
Yes
(Actual) 579 214
Accuracy 80.53 %
Misclassification Rat
e
19.47 %
True Positive Rate 0.27
False Positive Rate 0.73
Specificity 0.974
Precision 0.764
Prevalence 0.239
Age <= 37
AUC: 0.74
Age > 37
AUC: 0.73
Data is
portioned in a
70 - 30 split for
model-building
purposes
The
unsegmented
Model has a
better overall
performance
by 0.33 %
9. K-MEANS (3 CLUSTERS, K= 18)
# of
rows: No Yes
No 2715 143
Yes 585 227
Accuracy 80.16 %
Misclassification Rate 19.84 %
True Positive Rate 0.28
False Positive Rate 0.72
Specificity 0.95
Precision 0.614
Prevalence 0.22
Age <= 31
AUC: 0.73
# of
rows: No Yes
No 2530 95
Yes 526 183
Accuracy 81.37 %
Misclassification Rate 18.63 %
True Positive Rate 0.258
False Positive Rate 0.742
Specificity 0.964
Precision 0.658
Prevalence 0.21
Age {32 – 41}
AUC: 0.735
# of
rows: No Yes
No 1455 67
Yes 340 135
Accuracy 79.62 %
Misclassification Rate 20.38 %
True Positive Rate 0.284
False Positive Rate 0.716
Specificity 0.956
Precision 0.668
Prevalence 0.237
Age >= 42
AUC: 0.73
Data is portioned in a 70 - 30 split for model-building purposes The unsegmented Model has a better overall performance by 0.72
%
10. ANN (EPOCH 1000, LEARNING RATE 0.3, MOMENTUM 0.2)
# of rows: No (predicted) Yes
(Predicted)
No (Actual) 6691 332
Yes (Actual) 1313 664
Accuracy 81.72 %
Misclassification Rat
e
18.28 %
True Positive Rate 0.336
False Positive Rate 0.664
Specificity 0.953
Precision 0.667
Prevalence 0.219
Confusion Matrix
Data is portioned in a 70 - 30 split for model building purposes
Class Statistics
Graph: ROC Curve for ANN Model, AUC = 0.7434
11. CONCLUDING POINTS
• With the use of the ANN (Neural Network) model, we had a stronger
accuracy of 81.72% from the Confusion Matrix
• It also gives a powerful ROC curve (AUC = 0.7434), therefore,
providing a fit accuracy for predicting loan defaulters
• The true positive rate (aka sensitivity) is the highest for the ANN model
Model Accuracy
Simple KNN 81.11 %
K Means (2 Clusters) 80.78 %
K Means (3 Clusters) 80.38 %
ANN 81.72 %