New Generation in Credit Scoring: Machine Learning (Part 2)

Updated: Jun 16, 2020

Intro

Welcome back to this short series of credit scoring modelling! In Part 1 of this series (which you can find here) a credit card portfolio of a Taiwanese financial institution was examined. The aim was to evaluate the possibilities of transforming a risk-class credit scoring approach into a 'probability of default'-prediction approach. In a first trial, logistic regression was tested. Main problem in the logistic regression model was that even after some fine tuning the failure rate in predicting the default class was quite high, i.e. the failure rate was slightly above 50%.

Now we discuss the question whether different groups of machine learning models give better results than logistic regression. In this context k-nearest neighbour, quadratic discriminate analysis, decision tree, random forest and support vector machine are tested. The data pool is the same as was used in Part 1.

Please take note that the articles of this series are meant as an easy sneaking into the topic of machine learning taking credit scoring as an example. A more sophisticated analysis of the available data structure at the beginning of the whole process, an in-depth reasoning of the feature space to be incorporated in a model or a much stronger exhaustion of available tuning possibilities within each model would be necessary in order to achieve saturated results.

K-Nearest Neighbour

A first k-nearest neighbour model set-up

set.seed(1)

knn_fit <- knn3(factor(default_payment)~., data=train)
knn_y_hat <- predict(knn_fit, test, type="class")

confusionMatrix(knn_y_hat,factor(test$default_payment))

gives the following results:

Accuracy: 0.7898
95% CI: (0.7812, 0.7982)

Sensitivity: 0.9138
Specificity: 0.3550
Prevalence: 0.7781
Balanced Accuracy: 0.6344

'Positive' Class: 0

The specificity amounting to 35.5% is much lower than was the case with the logistic regression model while sensitivity is again up to more than 90%. This means that the model predicts the probability of non-default quite accurately but fails heavily in predicting the default class. But the probability of default is exactly what we are interested in.

Even after trying different ranges of similar (nearest) observations for predicting a data point ('number of k' as shown on the x-axis in the graph below), there is no improvement in the overall performance of the algorithm. The plot below clearly indicates that the model predicts the probability of default in just 35 to 40 % of the cases accurately (see the green line) given the test set.

According to the model, with non-default (or 'class 0') being the positive class and default (or 'class 1') being the negative class, the amount of True Negatives (i.e. observed 'class 1' cases were predicted as 'class 1' cases) is too small in order to be a successful model for this purposes. As displayed in the table below just 709 observed 'class 1' cases in the test set were also predicted as 'class 1' cases (= true negative), but 1288 observed 'class 1' cases were wrongly predicted as 'class 0' cases (= false positive).

confusionMatrix(knn_y_hat,factor(test$default_payment))$table

Reference
Prediction 0 1
0 6399 1288
1 604 709

Quadratic Discriminate Analysis

It turns out that the quadratic discriminate analysis (qda)-model shows much better results:

true value
0 1
predicted value 0 4909 644
1 2094 1353

There is a much higher amount of true negatives, i.e. much more default cases are predicted in the right way. Hence, the specificity is up to 67.8 % which is clearly higher compared to other models seen so far. Though, this doesn't come without costs. Overall accuracy is down to 69.6 %. Some metrics of the model:

Accuracy: 0.6958
95% CI: (0.6862, 0.7053)

Sensitivity: 0.7010
Specificity: 0.6775

Given the test set, the qda-model predicts 2/3 of probability of non-default in the right way. But to put this result into perspective, when using this tool around 30% of non-default clients would be classified as 'probability of default' which of course would be lost credit clients. Still, probably a model to work with and subject to further analysis. Though qda is generally not the first choice for larger data sets.

Short side-note: The linear discriminate analysis (LDA) hardly matches the results of the logistic regression model which should not be a surprise. The LDA approach is simply to inflexible for the current data structure to be useful.

Decision Tree and Random Forest

To make a long story short, one of the model-setups resulted in this decision tree:

With an accuracy of 81.7 % only around 1/3 of the observed default cases were predicted as probability of default. So also with decision trees the number of true negatives is too small to give room for a useful model. That is definitely not good news.

So, the next logical step would be to spice up the overall approach and try random forestmodels. In this case, several parameters were fine tuned (minimum number of nodes in the model, number of randomly selected features to be taken into account and similar). The metrics to be optimised in the model was also changed (from 'Accuracy' to 'Kappa': this will be further explained when discussing the data structure). Finally, non-default cases were given more weight in the overall model. Set-up as follows:

control <– trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
sampling = "up")

grid <- expand.grid(minNode = c(1,3,5),
predFixed = seq(1,17,by=1))

train_rf <- train(x = train[,1:17], y = factor(train[,18]),
method = "Rborist",
metric = "Kappa",
nTree = 50,
trControl = control,
tuneGrid = grid,
nSample = 5000)

But also here, the specificity rates do not reach the benchmark given by the logistic regression model. In a best case scenario, the ratio of true negatives are floating anywhere around 35%. The scenarios are plotted below to illustrate the results:

Those results are not a surprise given the outcome with the decision tree model. Random forest, in principle, averages the results of many simulated decision trees. Hence, the average of, in general, weak results stays a weak result.

Support Vector Machines

Support vector machines (svm) with linear, polynomial kernel and radial parameters were tested. It turns out that the polynomial svm gives stronger signals. Another useful gimmick of svm models is that wrong results in separate classes can be penalised in a different way. So in this example, we took it as granted that not recognising a possibility of default hurts much more than just losing a client (i.e. classifying a non-default client as probability of default). Hence, different weight classes to the disadvantage of false positives were implemented. Here the results:

Accuracy: 0.7427
95% CI: (0.7335, 0.7517)

Sensitivity: 0.7730
Specificity: 0.6365

true value
0 1
predicted value 0 5413 726
1 1590 1271

Specificity results are not as good as with the qda-model but are significantly better than all other models seen so far. Given the outcome, this also could be a model to work with. One of the disadvantages here was that this model, together with the random forest simulation, took the longest period of time for calculation.

Conclusion

So far, only quadratic discriminate analysis and support vector machines (the polynomial version) gave a significant improvement of results compared to the logistic regression model. One of the basic problems is that the data is quite imbalanced. Only around 1/5 of the observations represent default cases. I think it is time to have a look at the data itself. This will be done in Part 3 of this series before we check on neural networks and gradient boosting.

References

Hands-On Machine Learning with R by Bradley Boehmke and Brandon Greenwell, 2019

Tune Machine Learning Algorithms in R (random forest case study) by Jason Brownlee, published in Machine Learning Mastery/ August 22, 2019

Dealing with unbalanced data in machine learning by Shirin's playgRound, published in R-bloggers/ April 1, 2017