• Christian Schitton

New Generation in Credit Scoring: Machine Learning (Part 4)

Updated: May 8, 2020


Intro


Here we are again. This is a miniseries focusing on the possibilities of machine learningtools in the credit scoring of financial institutions. Part One of this series talked about the composition of the data set and how to pre-process this data for the purpose of classifying clients into defaulting and non-defaulting borrowers. Logistic Regression was applied as a basic classification tool. The regression results gave a benchmark for the machine learning instruments to be discussed later on.


Part Two of this series handled a bunch of machine learning tool kits and revealed that on the one hand it is not enough to address the topic of 'Accuracy' when aligning the success of different machine learning approaches. On the other hand, the results achieved were mediocre, to put it mildly and in some cases they were not better than the logistic regression outcome. Only the Quadratic Discriminate Analysis and the Support Vector Machines turned out to be positive exceptions. Part Three finally uncovered the reasons for those lukewarm findings and tried different paths to circumvent those issues. Much better results with most of the machine learning algorithms could then be achieved.


Now, equipped with the outcomes of the last three parts, this section takes a dive into the area of neural networks and gradient boosting machines. And, as we come to an end with this miniseries we will have a short talk on how the basic findings in those credit scoring models can be used in the business field of real estate.


Neural Networks


In short, a neural network has an input layer, one or more hidden layers and an output layer. Every layer consists of neurons (called nodes in case of an artificial neural network). So, for every input feature there is a node on the input layer and there is one node on the output layer for binary classification respectively several nodes in case of a multi-classification problem. In a supervised learning approach the neural network is fed with the input factors as well as the results of the dependent variable and tries to extract linear, but also non-linear relationships by setting different weight and bias for each respective node. In other words, observed data is used to train the neural network and the neural networks learns an approximation of the relationship by iteratively adapting its parameters.


There are different ways of how the nodes are able to communicate with each other. Hence there are different kinds of neural networks (e.g. feed forward neural networks, recurrent neural networks, deep convolutional networks or long short term memory networks to name some of the pathways). However, the systematics of a network is, in principle, each node uses weight and bias to transform received input into an output while an activation functiondetermines if a specific node should be activated and therefore release information. Here is a graphical summary of a neural network (graphic source: see Reference below):



Depending on the task of the neural network, there are different kinds of activation functions which can be incorporated, like Sigmoid Function, Tanh Function or the RELU (rectified linear unit) Function. And as the model is trained to best fit into the real data, the whole thing is adjusted with the aim to minimise a cost function. The definition of a cost function can be e.g. the Mean Square Error, the Mean Absolute Error Loss or a Multi Class Cross Entropy loss function.


Credit scoring with the neuralnet-package


The statistics software R offers different paths of working with neural networks. One of those possibilities is the neuralnet-package. This is a feedforward neural network with a resilient back-propagation algorithm. The package is quite flexible in modelling the network, but it also takes some computational time to train the model.


With the synthetically balanced data set in place (training on the imbalanced data set didn't achieve any suitable results; see also Part Three for more details), a first setup was roughly established with one hidden layer consisting of 9 nodes and mean square error as cost function. Given the available data quality and especially by contrast with other machine learning approaches, the results are promising:



Accuracy:    0.668
Sensitivity: 0.655
Specificity  0.715


Incorporating cross entropy as cost function, the results are as follows:



Accuracy:    0.727
Sensitivity: 0.756
Specificity: 0.629


The results are not bad but given the fact that a potential client's default is more costly than losing a client due to wrong classification, using mean square error as cost function seems more adequate in this case. In a second setup, the network was modelled with two hidden layers (mean square error still as cost function):



Accuracy:    0.704
Sensitivity: 0.716
Specificity: 0.662


The ratio of true negatives amounting to 66.2% is high while the rate of true positives could be kept above 70%. This is a suitable outcome but the price is a lot of computational time to be spent when training the model. Here is an impression of this artificial neural network model:



Credit scoring with the nnet-package


This is the high-speed version of an artificial neural network. The price for this speed is that the model has much less tuning options, e.g. there is only one hidden layer available. Though, with fixing the size of nodes in the hidden layer, limiting the number of possible iterations as well as setting the benchmark for weights on input data to be further considered, there should be enough flexibility to achieve reasonable results.


And indeed, the findings are quite convincing. An overview of results collected with this artificial neural network model tells the story.



decision benchmark p > 0.5

               Accuracy      Sensitivity      Specificity
------------I-------------I----------------I--------------- 
size=5:          0.745          0.784            0.607
size=9:          0.751          0.794            0.601
size=12:         0.720          0.743            0.642
size=14:         0.710          0.727            0.656


decision benchmark p > 0.42                                 

               Accuracy      Sensitivity      Specificity         ------------I-------------I----------------I---------------  size=5:          0.673          0.661            0.713      size=9:          0.728          0.753            0.643     size=12:         0.674          0.669            0.695     size=14:         0.684          0.684            0.683


The best and most even result was given by the nnet-model with a decision benchmark of p > 0.42 and with 14 nodes in the hidden layer. More than two thirds of the cases in both classes (default/ non-default) are predicted properly (the structure with the best findings in the plot below).



Gradient Boosting Machines


Simply said, boosting is a general algorithm for building an ensemble out of simpler machine learning models. It is most effective when applied to data with high bias and low variance. Boosting can be applied to any type of model, although it is most effectively applied to decision trees. Gradient boosting machines establish an ensemble of shallow trees (which are rather weak predictive models) in sequence, with each tree learning and improving on the previous one. Appropriately tuned, a powerful ensemble can be produced.


Boosting essentially means the handling of the bias-variance-tradeoff. So it starts with a weak model, i.e. in this case a decision tree with only a few splits, and sequentially builds up new trees. Each tree in the sequence tries to fix those areas where the previous tree made the biggest mistakes (i.e. the largest prediction errors). The cost function to be minimised can be anything from mean square error to mean absolute error and so on. In order to minimise the cost function, a gradient descent algorithm is at work which can be performed on any function that is differentiable.


Again, the statistics software R offers a range of available gradient boosting - packages. Here, two packages are discussed and we are focussing solely on the synthetically balanced data (for reasons see Part Three).


Credit scoring with the gbm-package


Here is the implementation of the model:



Gradient Boosting Machine

boost.fit.b <-
       gbm(default_payment~.,data=rose.train_set,distribution=
           "gaussian",n.trees=100000,shrinkage=0.005,
           interaction.depth=8)
 
               with the following hyperparameters:
                        
               n.trees           = number of trees in the model
               shrinkage         = the learning rate lambda
               interaction.depth = number of splits per tree
               
boost.predict.b <-
        predict(boost.fit.b,test,n.trees=100000)

boost.y_hat.b <-
        ifelse(boost.predict.b>0.5,1,0) %>% factor(levels=c(0,1))

confusionMatrix(boost.y_hat.b,factor(test$default_payment))               


Some of the hyperparameters in the model, i.e. the overall number of trees and the learning rate, were tuned and provided the following results:



 n.trees/shrinkage     Accuracy     Sensitivity     Specificity
--------------------I------------I---------------I---------------
    10000/0.005          0.550         0.470           0.829
    10000/0.01           0.628         0.592           0.751
    20000/0.005          0.611         0.564           0.774
    20000/0.01           0.635         0.604           0.746
    30000/0.005          0.629         0.593           0.753
    30000/0.01           0.650         0.628           0.728
    40000/0.005          0.634         0.600           0.751
    40000/0.01           0.651         0.628           0.732
    50000/0.005          0.632         0.607           0.746
    50000/0.01           0.655         0.634           0.726
   100000/0.005          0.648         0.625           0.731 
   100000/0.01           0.658         0.641           0.715 


With increasing the number of trees, the model finally settles at a Sensitivity equalling 64.1% and a Specificity of 71.5% and an overall Accuracy of 65.8%. It is worthwhile to take a glimpse into those features contributing most to the model results.



     var          rel.inf
-------------I---------------
      PAY_0      11.814792
   PAY_AMT2      11.514542
   PAY_AMT1       7.115962
      PAY_2       7.045782
      PAY_5       6.226774   
      PAY_3       6.059737
   PAY_AMT4       5.641118
      PAY_4       5.484497
   PAY_AMT3       5.111061
      PAY_6       5.021741
   PAY_AMT6       4.992688
   PAY_AMT5       4.409916
  BILL_AMT1       4.213171
  LIMIT_BAL       4.086520
   MARRIAGE       3.778397
  EDUCATION       3.760763
        SEX       3.722538



Provided the discussion in Part Three (see chapter Feature Variation/ Principal Component Analysis), this should not come as too much a surprise. It is also to be noted that with increasing the number of trees in the model, the relative importance of stronger input features decreases while the relative importance of weaker input features is moved up.


Credit Scoring with Extreme Gradient Boosting, the xgboost-package


When nnet is the ICE in the artificial neural network area, then xgboost is the Shinkansen in the gradient boosting machine arena. In principle, it is the same gradient descent algorithm framework but the model takes care for some optimisations in the hardware and the software environment (system optimisations: parallelisation, tree pruning, hardware optimisation like cache awareness by allocating internal buffers to store gradient statistics; algorithmic improvements: regularisation, sparsity awareness, weighted quantile sketch, cross validation). The graph below makes clear where we stand with the xgboost-machine learning tool (graph source: see References below)



The results come fast and are more than persuasive given the poor quality of the data set. In fact, after some tuning theextreme gradient boosting model achieved balanced outcomes for both classes on an overall high level. Again, more than 2/3 of the predictions are true positives (as shown below: Reference 'class 0'/ Prediction 'class 0') respectively true negatives (as shown below: Reference 'class 1'/ Prediction 'class 1').



           Accuracy: 0.6752
             95% CI: (0.6654, 0.6849)
        
        Sensitivity: 0.6749
        Specificity: 0.6765
         Prevalence: 0.7774
  Balanced Accuracy: 0.6757
         
   'Positive' Class: 0
         
   
                                          Reference
                            Prediction      0     1
                                     0   4722   648 
                                     1   2275  1355
                                     

Final remarks and where does Real Estate come in...


Well, I think we squeezed the lemon.


The table below gives an overview of all models handled in this series and applied to a synthetically balanced training data set of credit card clients.



Method                           Accuracy Sensitivity Specificity
--------------------------------I--------I-----------I-----------
logistic regression                0.694      0.716       0.617    

k-nearest neighbour                0.749      0.796       0.587
quadratic discriminate analysis    0.540      0.462       0.814
linear discriminate analysis       0.700      0.726       0.610
decision tree                      0.790      0.869       0.516
random forest                      0.652      0.631       0.726
support vector machines            0.750      0.793       0.601

neural network - neuralnet         0.704      0.716       0.662
neural network - nnet              0.684      0.684       0.683

gradient boosting - gbm            0.658      0.641       0.715
extreme gradient boosting -xgboost 0.675      0.675       0.677


Logistic regression as relative simple benchmark model did perform quite well. The best results, in my point of view, were provided by neural networks (both models) and the extreme gradient boosting machine. One argument which speaks for the nnet-model as well as the xgboost-model is their computational speed. When implementing artificial neural networks, it is to be taken into account that they are much more data hungry than e.g. gradient boosting models which also can be applied to a small- to midrange data size.


The results have to be seen in the context of the available data quality. First of all, the data was imbalanced. The imbalance should be in the nature of loan exposures and could be refurbished quite easily. Another, more serious issue, was the fact that the input features had a low correlation to the dependent variable 'default_payment'. At the same time, the correlation among the input parameters was partly very high. This mixture puts limits to the possible performance of the machine learning means. And still, the more sophisticated models could predict more than two thirds of observations in the test data in a proper way which is rather impressive.


Of course, this miniseries could only touch the surface of machine learning topics. A lot more would have to be said with respect to e.g. the pre-processing and preparing of data, techniques of feature selection, the tuning of models or the creation of higher sophisticated ensembles. But I think, you got the point.


...real estate


Operating Cash Flow and short-term liquidity are crucial issues in the daily operations of a company. Real estate is no exception to this. Therefore, anything which could disrupt or negatively impact those areas is to be closely monitored. Tenant default is one of those disruptions.


Hence, tenant scoring can be an easy catch to cover this. In this context, tenant portfolios run on the same principles as credit portfolios. Default is the variable of interest. Tenant portfolios, as it is the case with credit portfolios, (hopefully) have non-default as the clear majority class, therefore representing an imbalanced data set. Setting the right framework, internal company data can be generated in order to properly 'feed' those machine learning tools.


In other words, a tenant scoring model in combination with a liquidity risk module creates a powerful management information tool with low maintenance needs but highly effective and timely warning procedures.


References


Hands-On Machine Learning with R by Bradley Boehmke and Brandon M. Greenwell, 2020


neuralnet: Training of Neural Networks by Frauke Günther and Stefan Fritsch published in Contributed Research Articles/ June 1, 2010


Using neural networks for credit scoring: a simple example by Bart published in Rbloggers/ July 4, 2013


Neural Network Models in R by Avinash Navlani published in DataCamp/ December 9, 2019


Classifications in R: Response Modeling/ Credit Scoring/ Credit Rating using Machine Learning Techniques by Ariful Mondal/ September 20, 2016


Credit Risk Prediction Using Artificial Neural Network Algorithm by Shruti Goyal published in Data Science Central/ March 14, 2018


Tales of a Traveller - Journey through the woods of Artificial Intelligence and Machine Learning by Jai Sharma/ n.a.


Convolutional Neural Networks for Visual Recognition by Fei-Fei Li, Justin Johnson and Serena Yeung/ Stanford University (CS231n), Spring 2019


Statistics is Freaking Hard: WTF is Activation Function by Prateek published in towards data science/ August 16, 2017


Fitting a neural network in R; neuralnet package by Michy Alice published in Rbloggers/ September 23, 2015


Visualizing neural networks from the nnet package by beckmw published in Rbloggers/ March 4, 2013


ANN Classification with 'nnet' Package in R by Rizka Yolanda published in Medium/ June 26, 2019


Why is Stochastic Gradient Descent...? by Abhishek Mehta published in Medium/ September 11, 2019


Gradient boosting in R by Anish Singh Walia published in datascience+/ February 16, 2018


XGBoost Documentation/ n.a.


XGBoost Algorithm: Long May She Reign! by Vishal Morde published in towards data science/ April 8, 2019


Ridge and Lasso Regression: L1 and L2 Regularization by Saptashwa Bhattacharyya published in towards data science/ September 26, 2018