New Generation in Credit Scoring: Machine Learning (Part 4)
Updated: May 8, 2020
Here we are again. This is a miniseries focusing on the possibilities of machine learningtools in the credit scoring of financial institutions. Part One of this series talked about the composition of the data set and how to pre-process this data for the purpose of classifying clients into defaulting and non-defaulting borrowers. Logistic Regression was applied as a basic classification tool. The regression results gave a benchmark for the machine learning instruments to be discussed later on.
Part Two of this series handled a bunch of machine learning tool kits and revealed that on the one hand it is not enough to address the topic of 'Accuracy' when aligning the success of different machine learning approaches. On the other hand, the results achieved were mediocre, to put it mildly and in some cases they were not better than the logistic regression outcome. Only the Quadratic Discriminate Analysis and the Support Vector Machines turned out to be positive exceptions. Part Three finally uncovered the reasons for those lukewarm findings and tried different paths to circumvent those issues. Much better results with most of the machine learning algorithms could then be achieved.
Now, equipped with the outcomes of the last three parts, this section takes a dive into the area of neural networks and gradient boosting machines. And, as we come to an end with this miniseries we will have a short talk on how the basic findings in those credit scoring models can be used in the business field of real estate.
In short, a neural network has an input layer, one or more hidden layers and an output layer. Every layer consists of neurons (called nodes in case of an artificial neural network). So, for every input feature there is a node on the input layer and there is one node on the output layer for binary classification respectively several nodes in case of a multi-classification problem. In a supervised learning approach the neural network is fed with the input factors as well as the results of the dependent variable and tries to extract linear, but also non-linear relationships by setting different weight and bias for each respective node. In other words, observed data is used to train the neural network and the neural networks learns an approximation of the relationship by iteratively adapting its parameters.
There are different ways of how the nodes are able to communicate with each other. Hence there are different kinds of neural networks (e.g. feed forward neural networks, recurrent neural networks, deep convolutional networks or long short term memory networks to name some of the pathways). However, the systematics of a network is, in principle, each node uses weight and bias to transform received input into an output while an activation functiondetermines if a specific node should be activated and therefore release information. Here is a graphical summary of a neural network (graphic source: see Reference below):
Depending on the task of the neural network, there are different kinds of activation functions which can be incorporated, like Sigmoid Function, Tanh Function or the RELU (rectified linear unit) Function. And as the model is trained to best fit into the real data, the whole thing is adjusted with the aim to minimise a cost function. The definition of a cost function can be e.g. the Mean Square Error, the Mean Absolute Error Loss or a Multi Class Cross Entropy loss function.
Credit scoring with the neuralnet-package
The statistics software R offers different paths of working with neural networks. One of those possibilities is the neuralnet-package. This is a feedforward neural network with a resilient back-propagation algorithm. The package is quite flexible in modelling the network, but it also takes some computational time to train the model.
With the synthetically balanced data set in place (training on the imbalanced data set didn't achieve any suitable results; see also Part Three for more details), a first setup was roughly established with one hidden layer consisting of 9 nodes and mean square error as cost function. Given the available data quality and especially by contrast with other machine learning approaches, the results are promising:
Accuracy: 0.668 Sensitivity: 0.655 Specificity 0.715
Incorporating cross entropy as cost function, the results are as follows:
Accuracy: 0.727 Sensitivity: 0.756 Specificity: 0.629
The results are not bad but given the fact that a potential client's default is more costly than losing a client due to wrong classification, using mean square error as cost function seems more adequate in this case. In a second setup, the network was modelled with two hidden layers (mean square error still as cost function):
Accuracy: 0.704 Sensitivity: 0.716 Specificity: 0.662
The ratio of true negatives amounting to 66.2% is high while the rate of true positives could be kept above 70%. This is a suitable outcome but the price is a lot of computational time to be spent when training the model. Here is an impression of this artificial neural network model:
Credit scoring with the nnet-package
This is the high-speed version of an artificial neural network. The price for this speed is that the model has much less tuning options, e.g. there is only one hidden layer available. Though, with fixing the size of nodes in the hidden layer, limiting the number of possible iterations as well as setting the benchmark for weights on input data to be further considered, there should be enough flexibility to achieve reasonable results.
And indeed, the findings are quite convincing. An overview of results collected with this artificial neural network model tells the story.
decision benchmark p > 0.5 Accuracy Sensitivity Specificity ------------I-------------I----------------I--------------- size=5: 0.745 0.784 0.607 size=9: 0.751 0.794 0.601 size=12: 0.720 0.743 0.642 size=14: 0.710 0.727 0.656 decision benchmark p > 0.42 Accuracy Sensitivity Specificity ------------I-------------I----------------I--------------- size=5: 0.673 0.661 0.713 size=9: 0.728 0.753 0.643 size=12: 0.674 0.669 0.695 size=14: 0.684 0.684 0.683
The best and most even result was given by the nnet-model with a decision benchmark of p > 0.42 and with 14 nodes in the hidden layer. More than two thirds of the cases in both classes (default/ non-default) are predicted properly (the structure with the best findings in the plot below).
Gradient Boosting Machines
Simply said, boosting is a general algorithm for building an ensemble out of simpler machine learning models. It is most effective when applied to data with high bias and low variance. Boosting can be applied to any type of model, although it is most effectively applied to decision trees. Gradient boosting machines establish an ensemble of shallow trees (which are rather weak predictive models) in sequence, with each tree learning and improving on the previous one. Appropriately tuned, a powerful ensemble can be produced.
Boosting essentially means the handling of the bias-variance-tradeoff. So it starts with a weak model, i.e. in this case a decision tree with only a few splits, and sequentially builds up new trees. Each tree in the sequence tries to fix those areas where the previous tree made the biggest mistakes (i.e. the largest prediction errors). The cost function to be minimised can be anything from mean square error to mean absolute error and so on. In order to minimise the cost function, a gradient descent algorithm is at work which can be performed on any function that is differentiable.
Again, the statistics software R offers a range of available gradient boosting - packages. Here, two packages are discussed and we are focussing solely on the synthetically balanced data (for reasons see Part Three).
Credit scoring with the gbm-package
Here is the implementation of the model:
Gradient Boosting Machine boost.fit.b <- gbm(default_payment~.,data=rose.train_set,distribution= "gaussian",n.trees=100000,shrinkage=0.005, interaction.depth=8) with the following hyperparameters: n.trees = number of trees in the model shrinkage = the learning rate lambda interaction.depth = number of splits per tree boost.predict.b <- predict(boost.fit.b,test,n.trees=100000) boost.y_hat.b <- ifelse(boost.predict.b>0.5,1,0) %>% factor(levels=c(0,1)) confusionMatrix(boost.y_hat.b,factor(test$default_payment))
Some of the hyperparameters in the model, i.e. the overall number of trees and the learning rate, were tuned and provided the following results:
n.trees/shrinkage Accuracy Sensitivity Specificity --------------------I------------I---------------I--------------- 10000/0.005 0.550 0.470 0.829 10000/0.01 0.628 0.592 0.751 20000/0.005 0.611 0.564 0.774 20000/0.01 0.635 0.604 0.746 30000/0.005 0.629 0.593 0.753 30000/0.01 0.650 0.628 0.728 40000/0.005 0.634 0.600 0.751 40000/0.01 0.651 0.628 0.732 50000/0.005 0.632 0.607 0.746 50000/0.01 0.655 0.634 0.726 100000/0.005 0.648 0.625 0.731 100000/0.01 0.658 0.641 0.715
With increasing the number of trees, the model finally settles at a Sensitivity equalling 64.1% and a Specificity of 71.5% and an overall Accuracy of 65.8%. It is worthwhile to take a glimpse into those features contributing most to the model results.
var rel.inf -------------I--------------- PAY_0 11.814792 PAY_AMT2 11.514542 PAY_AMT1 7.115962 PAY_2 7.045782 PAY_5 6.226774 PAY_3 6.059737 PAY_AMT4 5.641118 PAY_4 5.484497 PAY_AMT3 5.111061 PAY_6 5.021741 PAY_AMT6 4.992688 PAY_AMT5 4.409916 BILL_AMT1 4.213171 LIMIT_BAL 4.086520 MARRIAGE 3.778397 EDUCATION 3.760763 SEX 3.722538
Provided the discussion in Part Three (see chapter Feature Variation/ Principal Component Analysis), this should not come as too much a surprise. It is also to be noted that with increasing the number of trees in the model, the relative importance of stronger input features decreases while the relative importance of weaker input features is moved up.
Credit Scoring with Extreme Gradient Boosting, the xgboost-package
When nnet is the ICE in the artificial neural network area, then xgboost is the Shinkansen in the gradient boosting machine arena. In principle, it is the same gradient descent algorithm framework but the model takes care for some optimisations in the hardware and the software environment (system optimisations: parallelisation, tree pruning, hardware optimisation like cache awareness by allocating internal buffers to store gradient statistics; algorithmic improvements: regularisation, sparsity awareness, weighted quantile sketch, cross validation). The graph below makes clear where we stand with the xgboost-machine learning tool (graph source: see References below)
The results come fast and are more than persuasive given the poor quality of the data set. In fact, after some tuning theextreme gradient boosting model achieved balanced outcomes for both classes on an overall high level. Again, more than 2/3 of the predictions are true positives (as shown below: Reference 'class 0'/ Prediction 'class 0') respectively true negatives (as shown below: Reference 'class 1'/ Prediction 'class 1').
Accuracy: 0.6752 95% CI: (0.6654, 0.6849) Sensitivity: 0.6749 Specificity: 0.6765 Prevalence: 0.7774 Balanced Accuracy: 0.6757 'Positive' Class: 0 Reference Prediction 0 1 0 4722 648 1 2275 1355
Final remarks and where does Real Estate come in...
Well, I think we squeezed the lemon.
The table below gives an overview of all models handled in this series and applied to a synthetically balanced training data set of credit card clients.
Method Accuracy Sensitivity Specificity --------------------------------I--------I-----------I----------- logistic regression 0.694 0.716 0.617 k-nearest neighbour 0.749 0.796 0.587 quadratic discriminate analysis 0.540 0.462 0.814 linear discriminate analysis 0.700 0.726 0.610 decision tree 0.790 0.869 0.516 random forest 0.652 0.631 0.726 support vector machines 0.750 0.793 0.601 neural network - neuralnet 0.704 0.716 0.662 neural network - nnet 0.684 0.684 0.683 gradient boosting - gbm 0.658 0.641 0.715 extreme gradient boosting -xgboost 0.675 0.675 0.677
Logistic regression as relative simple benchmark model did perform quite well. The best results, in my point of view, were provided by neural networks (both models) and the extreme gradient boosting machine. One argument which speaks for the nnet-model as well as the xgboost-model is their computational speed. When implementing artificial neural networks, it is to be taken into account that they are much more data hungry than e.g. gradient boosting models which also can be applied to a small- to midrange data size.
The results have to be seen in the context of the available data quality. First of all, the data was imbalanced. The imbalance should be in the nature of loan exposures and could be refurbished quite easily. Another, more serious issue, was the fact that the input features had a low correlation to the dependent variable 'default_payment'. At the same time, the correlation among the input parameters was partly very high. This mixture puts limits to the possible performance of the machine learning means. And still, the more sophisticated models could predict more than two thirds of observations in the test data in a proper way which is rather impressive.
Of course, this miniseries could only touch the surface of machine learning topics. A lot more would have to be said with respect to e.g. the pre-processing and preparing of data, techniques of feature selection, the tuning of models or the creation of higher sophisticated ensembles. But I think, you got the point.
Operating Cash Flow and short-term liquidity are crucial issues in the daily operations of a company. Real estate is no exception to this. Therefore, anything which could disrupt or negatively impact those areas is to be closely monitored. Tenant default is one of those disruptions.
Hence, tenant scoring can be an easy catch to cover this. In this context, tenant portfolios run on the same principles as credit portfolios. Default is the variable of interest. Tenant portfolios, as it is the case with credit portfolios, (hopefully) have non-default as the clear majority class, therefore representing an imbalanced data set. Setting the right framework, internal company data can be generated in order to properly 'feed' those machine learning tools.
In other words, a tenant scoring model in combination with a liquidity risk module creates a powerful management information tool with low maintenance needs but highly effective and timely warning procedures.
Hands-On Machine Learning with R by Bradley Boehmke and Brandon M. Greenwell, 2020
neuralnet: Training of Neural Networks by Frauke Günther and Stefan Fritsch published in Contributed Research Articles/ June 1, 2010
Using neural networks for credit scoring: a simple example by Bart published in Rbloggers/ July 4, 2013
Neural Network Models in R by Avinash Navlani published in DataCamp/ December 9, 2019
Classifications in R: Response Modeling/ Credit Scoring/ Credit Rating using Machine Learning Techniques by Ariful Mondal/ September 20, 2016
Credit Risk Prediction Using Artificial Neural Network Algorithm by Shruti Goyal published in Data Science Central/ March 14, 2018
Tales of a Traveller - Journey through the woods of Artificial Intelligence and Machine Learning by Jai Sharma/ n.a.
Convolutional Neural Networks for Visual Recognition by Fei-Fei Li, Justin Johnson and Serena Yeung/ Stanford University (CS231n), Spring 2019
Statistics is Freaking Hard: WTF is Activation Function by Prateek published in towards data science/ August 16, 2017
Fitting a neural network in R; neuralnet package by Michy Alice published in Rbloggers/ September 23, 2015
Visualizing neural networks from the nnet package by beckmw published in Rbloggers/ March 4, 2013
ANN Classification with 'nnet' Package in R by Rizka Yolanda published in Medium/ June 26, 2019
Why is Stochastic Gradient Descent...? by Abhishek Mehta published in Medium/ September 11, 2019
Gradient boosting in R by Anish Singh Walia published in datascience+/ February 16, 2018
XGBoost Documentation/ n.a.
XGBoost Algorithm: Long May She Reign! by Vishal Morde published in towards data science/ April 8, 2019
Ridge and Lasso Regression: L1 and L2 Regularization by Saptashwa Bhattacharyya published in towards data science/ September 26, 2018