- Christian Schitton

# New Generation in Credit Scoring: Machine Learning (Part 1)

Updated: Jun 16, 2020

**Intro**

In this article we look at possibilities to move from risk class-based credit scoring systems to __default-based scoring systems__. The overall goal is to minimise the default risk in a credit portfolio. So far, financial institutions score their credit clients according to quantitative and qualitative risk parameters and, based on the scoring result, classify them into different risk categories.

Here we will examine, whether there is a possibility to use internally available data in order to directly __predict__ the __probability of default__ of individual clients and we will monitor, how accurate those predictions might be. This article is partly based on the works of Rahim Rasool, I-Cheng Yeh and Che-hui Lien. So a lot of credits are theirs and references are below.

**The Data Pool**

As an example, a portfolio of 30,000 __credit cart clients__ of a Taiwanese financial institution is examined. The material is from the year 2005. The raw data consists of the following information:

credit card limit (LIMIT_BAL): amounts are in New Taiwan dollars

gender (SEX): 1...male, 2...female

education (EDUCATION): 1...graduate school, 2...university, 3...high school, 4...others

marriage status (MARRIAGE): 1...married, 2...single, 3...others

age (AGE): in years

history of payments in the previous months (PAY_0, PAY_2 to PAY_6): -1...pay duly, 1...payment delay for 1 month, 2...payment delay for 2 months, ..., 9...payment delay for 9 months and above

amount of bill statements (BILL_AMT1 to BILL_AMT6): bill amounts from the previous months

amount of previous payments (PAY_AMT1 to PAY_AMT6): amounts paid in the respective previous months

default classification (default_payment): 1...default, 0...no default

Summarised, we have 30,000 observations with 23 input features as well as the independent variable 'default_payment' which is the object of interest. After preprocessing and cleaning up the data, it can be seen that some of the input features have a very low correlation with the predictor 'default_payment' (i.e. the light areas on the default_payment-line in the graph below).

Hence, the feature space is reduced accordingly and the following six input features are eliminated and not taken into account any more: AGE, BILL_AMT2 to BILL_AMT6. Here is a short extract of the credit card portfolio (first six observations) finally ready for further examination:

```
LIMIT_BAL SEX EDUCATION MARRIAGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
1 2000 2 2 1 2 2 -1 -1 -2
2 120000 2 2 2 -1 2 0 0 0
3 90000 2 2 2 0 0 0 0 0
4 50000 2 2 1 0 0 0 0 0
5 50000 1 2 1 -1 0 -1 0 0
6 50000 1 1 2 0 0 0 0 0
PAY_6 BILL_AMT1 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5
1 -2 3913 0 689 0 0 0
2 2 2682 0 1000 1000 1000 0
3 0 29239 1518 1500 1000 1000 1000
4 0 46990 2000 2019 1200 1100 1069
5 0 8617 2500 36681 10000 9000 689
6 0 64400 2500 1815 657 1000 1000
PAY_AMT6 default_payment
1 0 1
2 2000 1
3 5000 0
4 1000 0
5 679 0
6 800 0
```

So, in the final framework 17 input features are taken into account. Before the whole data set is then split up into a training set of 21,000 observations and a test set of 9,000 observations, it is standardised in order to prevent any input feature with higher variability to dominate the scene.

**Predicting with Logistic Regression**

Technically, logistic regression is not a machine learning tool. But it is made for questions asking for binary outcomes (1...in default, 0...not in default) and with its 'easy-to-apply' approach this tool serves as a __good foundation__ for any further, more complicated machine learning tool kits. A switch to an upgraded machine learning technique (partly accompanied with higher requirements regarding data engineering and data quantity) does only make sense in case the prediction accuracy can be improved.

__Logistic regression model__ as follows:

```
glm_fit <- glm(default_payment~.,data=train,family="binomial")
p_hat_glm <- predict(glm_fit,test,type="response")
y_hat_glm <- ifelse(p_hat_glm > 0.5, 1, 0) %>% factor()
confusionMatrix(data=y_hat_glm,
reference=factor(test$default_payment))$
overall["Accuracy"]
```

In this case, the logistic regression model has an overall accuracy of around 80.8%. At first glance, this seems suitable. Breaking down the test results of the logistic regression gives the following picture:

```
table(y_hat_glm,test[,18])
true value
0 1
predicted value 0 6817 1544
1 186 453
```

As can be seen, the data set is dominated by non-default cases (value=0). So, although we have an overall __accuracy__ of the model of 80.8 %, the logistic regression predicts only 22.7% of the default cases as default cases which is not really convincing. In other words, due to the high prevalence of non-default cases in the data pool there is a very high __sensitivity__ (i.e. predicting the probability of non-default right) of around 97.3 % but a low __specificity__ (i.e. predicting the probability of default right) of just 22.7%.

**Improving the Logistic Regression Model**

In order to improve results, a __metrics__ called __F1__ is implemented. It relates sensitivity and specificity by means of a weighted harmonic average as follows

```
1
----------------------------------------------------------
beta^2 1 1 1
----------- --------------- + ----------- ----------------
1+beta^2 sensitivity 1+beta^2 specificity
```

whereas the parameter beta tells how important sensitivity is compared to specificity (beta = 1 means neutral, beta > 1 means sensitivity is favoured to specificity and vice versa). The point is to find a better mix between sensitivity and specificity and hence to improve the overall result.

In the current logistic regression frame, with a decision benchmark of p_hat_glm > 0.5 and a beta = 1, the F1 metrics equals 0.887. Simulating different ranges of the decision benchmark and different ranges of beta values with the aim to optimise the F1 metrics, the logistic regression model drives best with the following combination:

decision benchmark: p_hat_glm > 0.28

beta = 1.45

Based on the test data and in this combination, the model predicts the probability of non-default with 87.9 % accurately (= sensitivity) and the probability of default with 49.1 % accurately (=specificity) moving the overall accuracy to 79.3 % (F1 equals 0.919). The model accuracy is slightly lower than in the initial approach but the proper prediction of the probability of default more than doubled.

It is to be noted that the decision benchmark/ beta - mix as chosen above __improves__ the __specificity of the model significantly without endangering too many non-default cases being specified as a 'probability of default'__. See the table below:

```
true value
0 1
predicted value 0 6157 1016
1 846 981
```

To see what I mean, here are the results for a decision benchmark p_hat_glm > 0.11 and beta = 1.40. Also here the F1 metrics is comparably high with 0.916 and specificity jumps through the roof with 88.6 %. Though sensitivity is down to 24.9 % leaving an overall accuracy of just 39 %. Keeping the regression model this way would mean that on the one hand most of the default candidates could be covered but on the other hand a tremendous amount of non-default clients would be classified with 'probability of default'. A __lot of credit business__ for the financial institution __would be lost__ which is totally unacceptable. The table below shows the respective dimensions:

```
true value
0 1
predicted value 0 1744 228
1 5259 1769
```

**Conclusion**

Although the logistic regression model could be improved, the 'prediction quality' for default events is still quite low. Any increase in this direction comes with losing business as a lot of non-default clients would be drifting to the probability-of-default area.

Reasons can be found in the data itself (this topic will be handled in more detail in another article) and the limitations of logistic regression, especially with respect to the inability of covering non-linear relationships in the data. In the next article we have a look at some machine learning kits like k-nearest neighbour, quadratic discriminate analysis (qda), decision trees or random forests and find out if the credit scoring results can be improved.

**References**

Logistic Regression in R: A Classification Technique to Predict Credit Card Default by Rahim Rasool, published in R-bloggers/ November 12, 2019

The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients by I-Cheng Yeh and Che-hui Lien, published in ScienceDirect/ 2009

The Book of R by Tilman M. Davies, 2016

Data Science: Machine Learning by Prof. Rafael Irizarry/ Harvard University via edx.org (HarvardX: PH125.8x)