- Christian Schitton

# New Generation in Credit Scoring: Machine Learning (Part 4)

Updated: May 8, 2020

**Intro**

Here we are again. This is a miniseries focusing on the possibilities of **machine learning**tools in the **credit scoring** of financial institutions. Part One of this series talked about the composition of the data set and how to pre-process this data for the purpose of classifying clients into defaulting and non-defaulting borrowers. Logistic Regression was applied as a basic classification tool. The regression results gave a benchmark for the machine learning instruments to be discussed later on.

Part Two of this series handled a bunch of machine learning tool kits and revealed that on the one hand it is not enough to address the topic of 'Accuracy' when aligning the success of different machine learning approaches. On the other hand, the results achieved were mediocre, to put it mildly and in some cases they were not better than the logistic regression outcome. Only the Quadratic Discriminate Analysis and the Support Vector Machines turned out to be positive exceptions. Part Three finally uncovered the reasons for those lukewarm findings and tried different paths to circumvent those issues. Much better results with most of the machine learning algorithms could then be achieved.

Now, equipped with the outcomes of the last three parts, this section takes a dive into the area of **neural networks** and **gradient boosting machines**. And, as we come to an end with this miniseries we will have a short talk on how the basic **findings** in those credit scoring models **can be used in the business field of real estate**.

**Neural Networks**

In short, a neural network has an input **layer**, one or more hidden layers and an output layer. Every layer consists of neurons (called nodes in case of an artificial neural network). So, for every input feature there is a **node** on the input layer and there is one node on the output layer for binary classification respectively several nodes in case of a multi-classification problem. In a supervised learning approach the neural network is fed with the input factors as well as the results of the dependent variable and tries to extract linear, but also non-linear relationships by setting different **weight** and **bias** for each respective node. In other words, observed data is used to train the neural network and the neural networks learns an approximation of the relationship by iteratively adapting its parameters.

There are different ways of how the nodes are able to communicate with each other. Hence there are different kinds of neural networks (e.g. feed forward neural networks, recurrent neural networks, deep convolutional networks or long short term memory networks to name some of the pathways). However, the systematics of a network is, in principle, each node uses weight and bias to transform received input into an output while an **activation function**determines if a specific node should be activated and therefore release information. Here is a graphical summary of a neural network (graphic source: see Reference below):

Depending on the task of the neural network, there are different kinds of activation functions which can be incorporated, like Sigmoid Function, Tanh Function or the RELU (rectified linear unit) Function. And as the model is trained to best fit into the real data, the whole thing is adjusted with the aim to minimise a **cost function**. The definition of a cost function can be e.g. the Mean Square Error, the Mean Absolute Error Loss or a Multi Class Cross Entropy loss function.

**Credit scoring with the neuralnet-package**

The statistics software R offers different paths of working with neural networks. One of those possibilities is the neuralnet-package. This is a feedforward neural network with a resilient back-propagation algorithm. The package is **quite flexible in modelling the network**, but it also **takes some computational time** to train the model.

With the synthetically balanced data set in place (training on the imbalanced data set didn't achieve any suitable results; see also Part Three for more details), a first setup was roughly established with one hidden layer consisting of 9 nodes and mean square error as cost function. Given the available data quality and especially by contrast with other machine learning approaches, the results are promising:

```
Accuracy: 0.668
Sensitivity: 0.655
Specificity 0.715
```

Incorporating cross entropy as cost function, the results are as follows:

```
Accuracy: 0.727
Sensitivity: 0.756
Specificity: 0.629
```

The results are not bad but given the fact that a potential client's default is more costly than losing a client due to wrong classification, using mean square error as cost function seems more adequate in this case. In a second setup, the network was modelled with two hidden layers (mean square error still as cost function):

```
Accuracy: 0.704
Sensitivity: 0.716
Specificity: 0.662
```

The ratio of true negatives amounting to 66.2% is high while the rate of true positives could be kept above 70%. This is a suitable outcome but the price is a lot of computational time to be spent when training the model. Here is an impression of this artificial neural network model:

**Credit scoring with the nnet-package**

This is the **high-speed version** of an artificial neural network. The price for this speed is that the model has much **less tuning options**, e.g. there is only one hidden layer available. Though, with fixing the size of nodes in the hidden layer, limiting the number of possible iterations as well as setting the benchmark for weights on input data to be further considered, there should be enough flexibility to achieve reasonable results.

And indeed, the findings are quite convincing. An overview of results collected with this artificial neural network model tells the story.

```
decision benchmark p > 0.5
Accuracy Sensitivity Specificity
------------I-------------I----------------I---------------
size=5: 0.745 0.784 0.607
size=9: 0.751 0.794 0.601
size=12: 0.720 0.743 0.642
size=14: 0.710 0.727 0.656
decision benchmark p > 0.42
Accuracy Sensitivity Specificity ------------I-------------I----------------I--------------- size=5: 0.673 0.661 0.713 size=9: 0.728 0.753 0.643 size=12: 0.674 0.669 0.695 size=14: 0.684 0.684 0.683
```

The best and most even result was given by the nnet-model with a decision benchmark of p > 0.42 and with 14 nodes in the hidden layer. **More than two thirds of the cases in both classes** (default/ non-default) **are predicted properly (**the structure with the best findings in the plot below).

**Gradient Boosting Machines**

Simply said, boosting is a general **algorithm for building an ensemble out of simpler machine learning models**. It is most effective when applied to data with high bias and low variance. Boosting can be applied to any type of model, although it is **most effectively applied to decision trees**. Gradient boosting machines establish an ensemble of shallow trees (which are rather weak predictive models) in sequence, with each tree learning and improving on the previous one. Appropriately tuned, a powerful ensemble can be produced.

Boosting essentially means the **handling of the bias-variance-tradeoff**. So it starts with a weak model, i.e. in this case a decision tree with only a few splits, and sequentially builds up new trees. Each tree in the sequence tries to fix those areas where the previous tree made the biggest mistakes (i.e. the largest prediction errors). The cost function to be minimised can be anything from mean square error to mean absolute error and so on. In order to minimise the cost function, a gradient descent algorithm is at work which can be performed on any function that is differentiable.

Again, the **statistics software R** offers a range of available gradient boosting - packages. Here, two packages are discussed and we are focussing solely on the synthetically balanced data (for reasons see Part Three).

**Credit scoring with the gbm-package**

Here is the implementation of the model:

```
Gradient Boosting Machine
boost.fit.b <-
gbm(default_payment~.,data=rose.train_set,distribution=
"gaussian",n.trees=100000,shrinkage=0.005,
interaction.depth=8)
with the following hyperparameters:
n.trees = number of trees in the model
shrinkage = the learning rate lambda
interaction.depth = number of splits per tree
boost.predict.b <-
predict(boost.fit.b,test,n.trees=100000)
boost.y_hat.b <-
ifelse(boost.predict.b>0.5,1,0) %>% factor(levels=c(0,1))
confusionMatrix(boost.y_hat.b,factor(test$default_payment))
```

Some of the hyperparameters in the model, i.e. the overall number of trees and the learning rate, were tuned and provided the following results:

```
n.trees/shrinkage Accuracy Sensitivity Specificity
--------------------I------------I---------------I---------------
10000/0.005 0.550 0.470 0.829
10000/0.01 0.628 0.592 0.751
20000/0.005 0.611 0.564 0.774
20000/0.01 0.635 0.604 0.746
30000/0.005 0.629 0.593 0.753
30000/0.01 0.650 0.628 0.728
40000/0.005 0.634 0.600 0.751
40000/0.01 0.651 0.628 0.732
50000/0.005 0.632 0.607 0.746
50000/0.01 0.655 0.634 0.726
100000/0.005 0.648 0.625 0.731
100000/0.01 0.658 0.641 0.715
```

With increasing the number of trees, the model finally settles at a Sensitivity equalling 64.1% and a Specificity of 71.5% and an overall Accuracy of 65.8%. It is worthwhile to take a glimpse into those **features contributing most to the model results**.

```
var rel.inf
-------------I---------------
PAY_0 11.814792
PAY_AMT2 11.514542
PAY_AMT1 7.115962
PAY_2 7.045782
PAY_5 6.226774
PAY_3 6.059737
PAY_AMT4 5.641118
PAY_4 5.484497
PAY_AMT3 5.111061
PAY_6 5.021741
PAY_AMT6 4.992688
PAY_AMT5 4.409916
BILL_AMT1 4.213171
LIMIT_BAL 4.086520
MARRIAGE 3.778397
EDUCATION 3.760763
SEX 3.722538
```

Provided the discussion in Part Three (see chapter Feature Variation/ Principal Component Analysis), this should not come as too much a surprise. It is also to be noted that with increasing the number of trees in the model, the relative importance of stronger input features decreases while the relative importance of weaker input features is moved up.

**Credit Scoring with Extreme Gradient Boosting, the xgboost-package**

When nnet is the ICE in the artificial neural network area, then xgboost is the Shinkansen in the gradient boosting machine arena. In principle, it is the **same gradient descent algorithm framework but** the model **takes care for some optimisations in the hardware and the software environment** (system optimisations: parallelisation, tree pruning, hardware optimisation like cache awareness by allocating internal buffers to store gradient statistics; algorithmic improvements: regularisation, sparsity awareness, weighted quantile sketch, cross validation). The graph below makes clear where we stand with the xgboost-machine learning tool (graph source: see References below)

The results come fast and are more than persuasive given the poor quality of the data set. In fact, after some tuning the**extreme gradient boosting model achieved balanced outcomes for both classes on an overall high level**. Again, more than 2/3 of the predictions are true positives (as shown below: Reference 'class 0'/ Prediction 'class 0') respectively true negatives (as shown below: Reference 'class 1'/ Prediction 'class 1').

```
Accuracy: 0.6752
95% CI: (0.6654, 0.6849)
Sensitivity: 0.6749
Specificity: 0.6765
Prevalence: 0.7774
Balanced Accuracy: 0.6757
'Positive' Class: 0
Reference
Prediction 0 1
0 4722 648
1 2275 1355
```

**Final remarks and where does Real Estate come in...**

Well, I think we squeezed the lemon.

The table below gives an **overview** of all models handled in this series and applied to a synthetically balanced training data set of credit card clients.

```
Method Accuracy Sensitivity Specificity
--------------------------------I--------I-----------I-----------
logistic regression 0.694 0.716 0.617
k-nearest neighbour 0.749 0.796 0.587
quadratic discriminate analysis 0.540 0.462 0.814
linear discriminate analysis 0.700 0.726 0.610
decision tree 0.790 0.869 0.516
random forest 0.652 0.631 0.726
support vector machines 0.750 0.793 0.601
neural network - neuralnet 0.704 0.716 0.662
neural network - nnet 0.684 0.684 0.683
gradient boosting - gbm 0.658 0.641 0.715
extreme gradient boosting -xgboost 0.675 0.675 0.677
```

Logistic regression as relative simple benchmark model did perform quite well. The **best results**, in my point of view, were provided by neural networks (both models) and the extreme gradient boosting machine. One argument which speaks for the nnet-model as well as the xgboost-model is their **computational speed**. When implementing artificial **neural networks**, it is to be taken into account that they are much more **data hungry** than e.g. **gradient boosting models** which **also** can be **applied to a small- to midrange data size**.

The results have to be seen in the context of the **available data quality**. First of all, the data was imbalanced. The imbalance should be in the nature of loan exposures and could be refurbished quite easily. Another, more serious issue, was the fact that the input features had a low correlation to the dependent variable 'default_payment'. At the same time, the correlation among the input parameters was partly very high. This mixture **puts limits to the possible performance** of the machine learning means. And still, the more sophisticated models could predict more than two thirds of observations in the test data in a proper way which is rather impressive.

Of course, this miniseries could only touch the surface of machine learning topics. A lot more would have to be said with respect to e.g. the pre-processing and preparing of data, techniques of feature selection, the tuning of models or the creation of higher sophisticated ensembles. But I think, you got the point.

**...real estate**

**Operating Cash Flow** and **short-term liquidity** are crucial issues in the daily operations of a company. Real estate is no exception to this. Therefore, anything which could disrupt or negatively impact those areas is to be closely monitored. Tenant default is one of those disruptions.

Hence, **tenant scoring** can be an easy catch to cover this. In this context, tenant portfolios run on the same principles as credit portfolios. Default is the variable of interest. Tenant portfolios, as it is the case with credit portfolios, (hopefully) have non-default as the clear majority class, therefore representing an imbalanced data set. Setting the right framework, internal company data can be generated in order to properly 'feed' those machine learning tools.

In other words, a **tenant scoring model in combination with** a **liquidity risk module** creates a **powerful management information tool** with low maintenance needs but highly effective and timely warning procedures.

**References**

Hands-On Machine Learning with R by Bradley Boehmke and Brandon M. Greenwell, 2020

neuralnet: Training of Neural Networks by Frauke Günther and Stefan Fritsch published in Contributed Research Articles/ June 1, 2010

Using neural networks for credit scoring: a simple example by Bart published in Rbloggers/ July 4, 2013

Neural Network Models in R by Avinash Navlani published in DataCamp/ December 9, 2019

Classifications in R: Response Modeling/ Credit Scoring/ Credit Rating using Machine Learning Techniques by Ariful Mondal/ September 20, 2016

Credit Risk Prediction Using Artificial Neural Network Algorithm by Shruti Goyal published in Data Science Central/ March 14, 2018

Tales of a Traveller - Journey through the woods of Artificial Intelligence and Machine Learning by Jai Sharma/ n.a.

Convolutional Neural Networks for Visual Recognition by Fei-Fei Li, Justin Johnson and Serena Yeung/ Stanford University (CS231n), Spring 2019

Statistics is Freaking Hard: WTF is Activation Function by Prateek published in towards data science/ August 16, 2017

Fitting a neural network in R; neuralnet package by Michy Alice published in Rbloggers/ September 23, 2015

Visualizing neural networks from the nnet package by beckmw published in Rbloggers/ March 4, 2013

ANN Classification with 'nnet' Package in R by Rizka Yolanda published in Medium/ June 26, 2019

Why is Stochastic Gradient Descent...? by Abhishek Mehta published in Medium/ September 11, 2019

Gradient boosting in R by Anish Singh Walia published in datascience+/ February 16, 2018

XGBoost Documentation/ n.a.

XGBoost Algorithm: Long May She Reign! by Vishal Morde published in towards data science/ April 8, 2019

Ridge and Lasso Regression: L1 and L2 Regularization by Saptashwa Bhattacharyya published in towards data science/ September 26, 2018