New Generation in Credit Scoring: Machine Learning (Part 3)
Updated: Jun 16, 2020
All the models so far gave good to very good results in classifying the probability of non-default (class 0). Therefore, sensitivity results for all of the models trained were relatively high. Unfortunately, the probability of default (class 1) which would be the main concern in a credit scoring model did not work out so well. Specificity results range from a mere 35 % to only 40%. Just the QDA-model with approximately 68% as well as the polynomial SVM-model with around 64% were significantly above average.
But in general, the results are far from being satisfying. So, time to have a deeper dive into some of the reasons why the results are not convincing so far.
As shown in Part 1, the data consists of 30,000 observations of a credit card portfolio. Initially there were 23 input features. Based on a correlation check, only 17 of those input features were finally used in the analysis. Here is an excerpt of the input parameters plus the dependent variable 'default_payment':
LIMIT_BAL SEX EDUCATION MARRIAGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 1 2000 2 2 1 2 2 -1 -1 -2 2 120000 2 2 2 -1 2 0 0 0 3 90000 2 2 2 0 0 0 0 0 4 50000 2 2 1 0 0 0 0 0 5 50000 1 2 1 -1 0 -1 0 0 6 50000 1 1 2 0 0 0 0 0 PAY_6 BILL_AMT1 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 1 -2 3913 0 689 0 0 0 2 2 2682 0 1000 1000 1000 0 3 0 29239 1518 1500 1000 1000 1000 4 0 46990 2000 2019 1200 1100 1069 5 0 8617 2500 36681 10000 9000 689 6 0 64400 2500 1815 657 1000 1000 PAY_AMT6 default_payment 1 0 1 2 2000 1 3 5000 0 4 1000 0 5 679 0 6 800 0
The correlation matrix below consisting of the 17 remaining input variables as well as the variable 'default_payment' offers a notion why the machine learning models did not perform in a better way.
The strongest correlation between 'default_payment' and the input features is with the monthly payment datas PAY_0, PAY_2 to PAY_6 (those features show if payment is made in due time respectively how many months a payment is overdue). Top value here is the correlation between 'default_payment' and PAY_0 amounting to 0.32. But even this top value is quite weak. On the other hand, correlation among input variables themselves can be quite strong. This is especially true with respect to the feature group PAY_0, PAY_2 to PAY_6. As an example, the correlation between PAY_5 and the other predictors in this feature group is between 0.50 and 0.81. Stronger correlation among input parameters while facing a weak correlation between the dependent variable and the input variables is not a good point to start with.
Another pathway to look at the problem is by means of Principal Component Analysis (PCA). In principle, techniques like PCA try to reduce the feature space by creating new attributes via (weighted) linear combinations of existing input parameters. Those newly created parameters (called principal components) are independent of each other. The big advantage is that in most cases the variance of the original feature space can then be explained by just a few principal components which is very useful when facing complex feature spaces.
In case of the credit card portfolio the first two principal components cover around 40 % of the input parameters' variance and up to 10 principal components are needed to match for 80% of the variation.
And as the breakdown of principal component 1 (Dim1) and principal component 2 (Dim2) below reveals, it is the features group of PAY_0, PAY_2 to PAY_6 which accounts for the heaviest contribution.
Handling the Data Set
Let's have a look into some of the input parameters of the data pool. Here is a breakdown of the payment situation of the previous month (PAY_0) and the credit card limit (LIMIT_BAL) divided by default (class 1) and non-default cases (class 0):
Default and non-default areas are quite overlapping. This is true for any credit limit but the overlapping is significantly stronger for smaller credit amounts. There are non-default cases even with payments overdue in the 5-months (and beyond) area while there are default events in the early payment region. Watching this development over the previous six months, one can see that this pattern is prevailing during the whole period of time.
From this results we can take two conclusions. On the basis of the available data, the two groups default/ non-default are not clustered enough. But more important, the constant mixing up which happens especially with small credit limits means that the existing credit scoring system seems not to be handled properly or the classification based on the current scheme is not working effectively (as said before a lot of default cases are in the 'early payment' area while a lot of non-default cases have payment delays of 5+ months). This could be a sign of operational risk.
Imbalanced Data Structure
The whole data set consists of 78% non-default clients and 22% default-clients. In other words, the data is imbalanced. Though, this should not be unusual for a credit portfolio. Hopefully, there are far less default cases than non-default cases in a credit exposure. Issue is, that imbalanced data can cause troubles with classifying machine learning algorithms. Imbalanced data is rather typical for binary data sets. The main problem is that classification algorithms tend to be biased toward the majority group as they try to improve the overall accuracy.
Without further handling, the majority class (i.e. class 0) will prevail the minority class (i.e. class 1). Here is a summary of the results so far:
Method Accuracy Sensitivity Specificity -----------------------I------------I--------------I------------- logistic regression 0.793 0.879 0.491 knn 0.789 0.914 0.355 qda 0.696 0.701 0.678 lda 0.810 0.971 0.245 decision tree 0.817 0.950 0.352 random forest 0.814 0.948 0.346 svm 0.743 0.773 0.637
Possibilities to improve the results are:
Rebalancing the data set
Changing the classification threshold: as done with the logistic regression model in Part 1
Implementing a cost function: This is to penalise wrong classifications for different groups in different ways. This approach was executed in the support vector machine-model as shown in Part 2. In this model the wrong classification of default-cases was much more penalised than the wrong classification of non-default cases.
The rebalancing of the data set works in a way that the two groups are made equal. This happens either by upgrading the minority group or by downsizing the majority group. Another option is to do both, i.e. upgrading and downsizing, at the same time or to create a synthetic balanced data set. So, based on the initial allocation in the training data set:
table(train$default_payment) class: 0 1 number: 16361 4639
the minority group can be upgraded (called oversampling):
over.train_set <- ovun.sample(default_payment~.,data=train,method="over", N=32722)$data table(over.train_set$default_payment) class: 0 1 number: 16361 16361
or the majority group can be downsized (called undersampling):
under.train_set <- ovun.sample(default_payment~.,data=train,method="under", N=9278)$data table(under.train_set$default_payment) class: 0 1 number: 4639 4639
or both approaches can be done at the same time:
both.train_set <- ovun.sample(default_payment~.,data=train,method="both", p=0.5)$data table(both.train_set$default_payment) class: 0 1 number: 10564 10564
or additional data points of the minority class can be created synthetically:
rose.train_set <- ROSE(default_payment~.,data=train,seed=1)$data table(rose.train_set$default_payment) class: 0 1 number: 10557 10443
K-Nearest Neighbour Model
As an example, the k-nearest neighbour model is trained on each of the balanced data sets:
1. Oversampled balanced data: over.knn_fit <- train(factor(default_payment)~.,method="knn",data= over.train_set,tuneGrid=data.frame(k=seq(1,35,1)), trControl=control) cm_over.knn <- confusionMatrix(predict(over.knn_fit,test), factor(test$default_payment)) Accuracy: 0.727 Sensitivity: 0.818 Specificity: 0.408 2. Undersampled balanced data: under.knn_fit <- train(factor(default_payment)~.,method="knn",data= under.train_set,tuneGrid=data.frame(k=seq(1,35,1)), trControl=control) cm_under.knn <- confusionMatrix(predict(under.knn_fit,test), factor(test$default_payment)) Accuracy: 0.746 Sensitivity: 0.788 Specificity: 0.597 3. Majority and minority class adjusted at the same time: both.knn_fit <- train(factor(default_payment)~.,method="knn",data= both.train_set,tuneGrid=data.frame(k=seq(1,35,1)), trControl=control) cm_both.knn <- confusionMatrix(predict(both.knn_fit,test), factor(test$default_payment)) Accuracy: 0.676 Sensitivity: 0.720 Specificity: 0.518 4. Synthetically balanced data: rose.knn_fit <- train(factor(default_payment)~.,method="knn",data= rose.train_set,tuneGrid=data.frame(k=seq(1,35,1)), trControl=control) cm_rose.knn <- confusionMatrix(predict(rose.knn_fit), factor(test$default_payment)) Accuracy: 0.749 Sensitivity: 0.796 Specificity: 0.587
With the balanced data in place, in most of the cases the results for classifying the probability of default improved significantly. Though, it is also to be noted that any of those balancing approaches brings in certain disadvantages. For instance, in case of undersampling a lot of information could be lost in the course of sizing down the majority class. Another issue could be found in the so called sampling selection bias. The issue is that in a binary classification the machine learning model is trained on the training data. The results are used to generate predictions on the test data. It is assumed that both data sets (training and test data) are coming from the same distribution. This assumption might not hold when the training data is balanced while the test data has to stay in the original stage (i.e. imbalanced). This bias is accelerated when the data clusters are not separated enough (as it is the case with the credit card portfolio).
A discussion of this topic would exceed the scope of this article and as far as I could get it, papers on this issue are rather rare. Andrea Dal Pozzolo et al. respectively Maarit Widmann and Alfredo Roccato have certainly interesting papers to start with. See references below.
I decided to synthetically balance the training data and run logistic regression as well as all of the machine learning models previously discussed. Overall results as follows:
Method Accuracy Sensitivity Specificity -----------------------I------------I--------------I------------- logistic regression 0.694 0.716 0.617 knn 0.749 0.796 0.587 qda 0.540 0.462 0.814 lda 0.700 0.726 0.610 decision tree 0.790 0.869 0.516 random forest 0.652 0.631 0.726 svm 0.750 0.793 0.601
It turns out that, in general, the results for all of the models with respect to the classification of probability of default clearly improved while keeping the classification of probability of non-default approximately in line. The only exception is the quadratic discriminate analysis. Here we see a reversal of the initial problem, i.e. too much a focus on specificity while the sensitivity deteriorates too fast.
To be sure, considerations about the data quality and the data structure are to be done before diving into the modelling of classification algorithms. The available data for this credit card portfolio revealed that there is only a weak, in many cases just a poor, correlation between the input parameters and the variable of interest 'default/non-default'. At the same time, the correlation among input features, especially among certain groups of features, is quite strong. This combination is not an ideal prerequisite for a successful classification task. Additionally, the imbalanced nature of the data pool, typical as it is for this kind of question and business field, turns out to be another obstacle to classify the observations in the right way. The intermingled topography of the two classes in question just aggravates the poor data situation.
In general, there are two main stages in order to overcome those hurdles. First of all, the pre-processing of data is very important. The reorganisation of the balance between the minority and the majority group is one example in this stage. The implementation of additional data streams might be an option. It is also to be noted that while analysing the data, certain signs of operational risk of the existing credit scoring model could already be identified. The second option is to use the fine-tuning possibilities, a machine learning model might offer. Adjusting the beta-factor in a logistic regression model, or penalising false negatives in a harsher way in qda-model are just two examples in this stage which were used discussing Part 2 of this series.
Taking all this constraints into account, the results of most of the machine learning tools and also the result produced by the logistic regression model are rather impressive.
So much for now. In the next article we will have a look into cost functions as a mean for tuning machine learning algorithms. And slowly but yes, we'll move into neural networks.
On the Relationship Between Feature Selection and Classification Accuracy by Janecek et al./ 2008
Practical Guide to Principal Component Methods in R by Alboukadel Kassambara, 2017
Should one remove highly correlated variables before doing PCA? - discussion published in Cross Validated/ 2013
Making Predictions on Test Data after Principal Component Analysis in R published in Analytics Vidhya/ July 28, 2016
Dealing with unbalanced data in machine learning by Shirin Glander/ April 2, 2017
Practical Guide to deal with Imbalanced Classification Problems in R published by Analytics Vidhya/ March 28, 2016
Practical Guide to Principal Component Analysis (PCA) in R and Python published by Analytics Vidhya/ March 21, 2016
From Modeling to Scoring: Correcting Predicted Class Probabilities in Imbalanced Datasets by Maarit Widmann and Alfredo Roccato/ July 10, 2019
Imbalanced - Cost function - Best Cutoff in R (Credit Card Fraud Detection) published via kaggle/ 2018
Machine Learning from Imbalanced Data Sets 101 by Foster Provost, 2000
Calibrating Probability with Undersampling for Unbalanced Classification by Andrea Dal Pozzolo et al./ n.a.