• Christian Schitton

New Generation in Credit Scoring: Machine Learning (Part 2)

Updated: Jun 16, 2020


Intro


Welcome back to this short series of credit scoring modelling! In Part 1 of this series (which you can find here) a credit card portfolio of a Taiwanese financial institution was examined. The aim was to evaluate the possibilities of transforming a risk-class credit scoring approach into a 'probability of default'-prediction approach. In a first trial, logistic regression was tested. Main problem in the logistic regression model was that even after some fine tuning the failure rate in predicting the default class was quite high, i.e. the failure rate was slightly above 50%.


Now we discuss the question whether different groups of machine learning models give better results than logistic regression. In this context k-nearest neighbour, quadratic discriminate analysis, decision tree, random forest and support vector machine are tested. The data pool is the same as was used in Part 1.


Please take note that the articles of this series are meant as an easy sneaking into the topic of machine learning taking credit scoring as an example. A more sophisticated analysis of the available data structure at the beginning of the whole process, an in-depth reasoning of the feature space to be incorporated in a model or a much stronger exhaustion of available tuning possibilities within each model would be necessary in order to achieve saturated results.


K-Nearest Neighbour


A first k-nearest neighbour model set-up



set.seed(1)

knn_fit <- knn3(factor(default_payment)~., data=train)
knn_y_hat <- predict(knn_fit, test, type="class")

confusionMatrix(knn_y_hat,factor(test$default_payment))


gives the following results:



          Accuracy: 0.7898
            95% CI: (0.7812, 0.7982)
          
       Sensitivity: 0.9138
       Specificity: 0.3550
        Prevalence: 0.7781
 Balanced Accuracy: 0.6344
        
  'Positive' Class: 0
        

The specificity amounting to 35.5% is much lower than was the case with the logistic regression model while sensitivity is again up to more than 90%. This means that the model predicts the probability of non-default quite accurately but fails heavily in predicting the default class. But the probability of default is exactly what we are interested in.


Even after trying different ranges of similar (nearest) observations for predicting a data point ('number of k' as shown on the x-axis in the graph below), there is no improvement in the overall performance of the algorithm. The plot below clearly indicates that the model predicts the probability of default in just 35 to 40 % of the cases accurately (see the green line) given the test set.



According to the model, with non-default (or 'class 0') being the positive class and default (or 'class 1') being the negative class, the amount of True Negatives (i.e. observed 'class 1' cases were predicted as 'class 1' cases) is too small in order to be a successful model for this purposes. As displayed in the table below just 709 observed 'class 1' cases in the test set were also predicted as 'class 1' cases (= true negative), but 1288 observed 'class 1' cases were wrongly predicted as 'class 0' cases (= false positive).



confusionMatrix(knn_y_hat,factor(test$default_payment))$table

                    Reference
      Prediction       0     1
                0   6399  1288
                1    604   709


Quadratic Discriminate Analysis


It turns out that the quadratic discriminate analysis (qda)-model shows much better results:



                    true value
                       0     1
predicted value 0   4909   644
                1   2094  1353


There is a much higher amount of true negatives, i.e. much more default cases are predicted in the right way. Hence, the specificity is up to 67.8 % which is clearly higher compared to other models seen so far. Though, this doesn't come without costs. Overall accuracy is down to 69.6 %. Some metrics of the model:



         Accuracy: 0.6958
           95% CI: (0.6862, 0.7053)
         
      Sensitivity: 0.7010
      Specificity: 0.6775


Given the test set, the qda-model predicts 2/3 of probability of non-default in the right way. But to put this result into perspective, when using this tool around 30% of non-default clients would be classified as 'probability of default' which of course would be lost credit clients. Still, probably a model to work with and subject to further analysis. Though qda is generally not the first choice for larger data sets.


Short side-note: The linear discriminate analysis (LDA) hardly matches the results of the logistic regression model which should not be a surprise. The LDA approach is simply to inflexible for the current data structure to be useful.


Decision Tree and Random Forest


To make a long story short, one of the model-setups resulted in this decision tree:


With an accuracy of 81.7 % only around 1/3 of the observed default cases were predicted as probability of default. So also with decision trees the number of true negatives is too small to give room for a useful model. That is definitely not good news.


So, the next logical step would be to spice up the overall approach and try random forestmodels. In this case, several parameters were fine tuned (minimum number of nodes in the model, number of randomly selected features to be taken into account and similar). The metrics to be optimised in the model was also changed (from 'Accuracy' to 'Kappa': this will be further explained when discussing the data structure). Finally, non-default cases were given more weight in the overall model. Set-up as follows:



control <trainControl(method = "repeatedcv",
                        number = 5,
                        repeats = 5,
                        sampling = "up")
                     
grid <- expand.grid(minNode = c(1,3,5),
                    predFixed = seq(1,17,by=1))

train_rf <- train(x = train[,1:17], y = factor(train[,18]),
                  method = "Rborist",
                  metric = "Kappa",
                  nTree = 50,
                  trControl = control,
                  tuneGrid = grid,
                  nSample = 5000)   


But also here, the specificity rates do not reach the benchmark given by the logistic regression model. In a best case scenario, the ratio of true negatives are floating anywhere around 35%. The scenarios are plotted below to illustrate the results:



Those results are not a surprise given the outcome with the decision tree model. Random forest, in principle, averages the results of many simulated decision trees. Hence, the average of, in general, weak results stays a weak result.   


Support Vector Machines


Support vector machines (svm) with linear, polynomial kernel and radial parameters were tested. It turns out that the polynomial svm gives stronger signals. Another useful gimmick of svm models is that wrong results in separate classes can be penalised in a different way. So in this example, we took it as granted that not recognising a possibility of default hurts much more than just losing a client (i.e. classifying a non-default client as probability of default). Hence, different weight classes to the disadvantage of false positives were implemented. Here the results:



           Accuracy: 0.7427
             95% CI: (0.7335, 0.7517)
           
        Sensitivity: 0.7730
        Specificity: 0.6365
        
        
                      true value
                         0     1
   predicted value 0  5413   726
                   1  1590  1271  


Specificity results are not as good as with the qda-model but are significantly better than all other models seen so far. Given the outcome, this also could be a model to work with. One of the disadvantages here was that this model, together with the random forest simulation, took the longest period of time for calculation.


Conclusion


So far, only quadratic discriminate analysis and support vector machines (the polynomial version) gave a significant improvement of results compared to the logistic regression model. One of the basic problems is that the data is quite imbalanced. Only around 1/5 of the observations represent default cases. I think it is time to have a look at the data itself. This will be done in Part 3 of this series before we check on neural networks and gradient boosting.


References


Hands-On Machine Learning with R by Bradley Boehmke and Brandon Greenwell, 2019


Tune Machine Learning Algorithms in R (random forest case study) by Jason Brownlee, published in Machine Learning Mastery/ August 22, 2019


Dealing with unbalanced data in machine learning by Shirin's playgRound, published in R-bloggers/ April 1, 2017