• Christian Schitton

New Generation in Credit Scoring: Machine Learning (Part 1)

Updated: Jun 16, 2020


Intro


In this article we look at possibilities to move from risk class-based credit scoring systems to default-based scoring systems. The overall goal is to minimise the default risk in a credit portfolio. So far, financial institutions score their credit clients according to quantitative and qualitative risk parameters and, based on the scoring result, classify them into different risk categories.

Here we will examine, whether there is a possibility to use internally available data in order to directly predict the probability of default of individual clients and we will monitor, how accurate those predictions might be. This article is partly based on the works of Rahim Rasool, I-Cheng Yeh and Che-hui Lien. So a lot of credits are theirs and references are below.

The Data Pool

As an example, a portfolio of 30,000 credit cart clients of a Taiwanese financial institution is examined. The material is from the year 2005. The raw data consists of the following information:

  • credit card limit (LIMIT_BAL): amounts are in New Taiwan dollars

  • gender (SEX): 1...male, 2...female

  • education (EDUCATION): 1...graduate school, 2...university, 3...high school, 4...others

  • marriage status (MARRIAGE): 1...married, 2...single, 3...others

  • age (AGE): in years

  • history of payments in the previous months (PAY_0, PAY_2 to PAY_6): -1...pay duly, 1...payment delay for 1 month, 2...payment delay for 2 months, ..., 9...payment delay for 9 months and above

  • amount of bill statements (BILL_AMT1 to BILL_AMT6): bill amounts from the previous months

  • amount of previous payments (PAY_AMT1 to PAY_AMT6): amounts paid in the respective previous months

  • default classification (default_payment): 1...default, 0...no default


Summarised, we have 30,000 observations with 23 input features as well as the independent variable 'default_payment' which is the object of interest. After preprocessing and cleaning up the data, it can be seen that some of the input features have a very low correlation with the predictor 'default_payment' (i.e. the light areas on the default_payment-line in the graph below).


Hence, the feature space is reduced accordingly and the following six input features are eliminated and not taken into account any more: AGE, BILL_AMT2 to BILL_AMT6. Here is a short extract of the credit card portfolio (first six observations) finally ready for further examination:


  LIMIT_BAL SEX EDUCATION MARRIAGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
1      2000   2         2        1     2     2    -1    -1    -2     
2    120000   2         2        2    -1     2     0     0     0
3     90000   2         2        2     0     0     0     0     0
4     50000   2         2        1     0     0     0     0     0
5     50000   1         2        1    -1     0    -1     0     0
6     50000   1         1        2     0     0     0     0     0

  PAY_6 BILL_AMT1 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 
1    -2      3913        0      689        0        0        0
2     2      2682        0     1000     1000     1000        0
3     0     29239     1518     1500     1000     1000     1000
4     0     46990     2000     2019     1200     1100     1069
5     0      8617     2500    36681    10000     9000      689
6     0     64400     2500     1815      657     1000     1000

  PAY_AMT6    default_payment
1        0                  1       
2     2000                  1
3     5000                  0
4     1000                  0
5      679                  0
6      800                  0

So, in the final framework 17 input features are taken into account. Before the whole data set is then split up into a training set of 21,000 observations and a test set of 9,000 observations, it is standardised in order to prevent any input feature with higher variability to dominate the scene.

Predicting with Logistic Regression

Technically, logistic regression is not a machine learning tool. But it is made for questions asking for binary outcomes (1...in default, 0...not in default) and with its 'easy-to-apply' approach this tool serves as a good foundation for any further, more complicated machine learning tool kits. A switch to an upgraded machine learning technique (partly accompanied with higher requirements regarding data engineering and data quantity) does only make sense in case the prediction accuracy can be improved.

Logistic regression model as follows:


glm_fit <- glm(default_payment~.,data=train,family="binomial")
p_hat_glm <- predict(glm_fit,test,type="response")
y_hat_glm <- ifelse(p_hat_glm > 0.5, 1, 0) %>% factor()

confusionMatrix(data=y_hat_glm,
                reference=factor(test$default_payment))$
                overall["Accuracy"]
                

In this case, the logistic regression model has an overall accuracy of around 80.8%. At first glance, this seems suitable. Breaking down the test results of the logistic regression gives the following picture:



table(y_hat_glm,test[,18])

                          true value
                             0     1
       predicted value 0  6817  1544
                       1   186   453
     

As can be seen, the data set is dominated by non-default cases (value=0). So, although we have an overall accuracy of the model of 80.8 %, the logistic regression predicts only 22.7% of the default cases as default cases which is not really convincing. In other words, due to the high prevalence of non-default cases in the data pool there is a very high sensitivity (i.e. predicting the probability of non-default right) of around 97.3 % but a low specificity (i.e. predicting the probability of default right) of just 22.7%.

Improving the Logistic Regression Model

In order to improve results, a metrics called F1 is implemented. It relates sensitivity and specificity by means of a weighted harmonic average as follows



                              1
   ----------------------------------------------------------
     beta^2          1                1             1
   ----------- --------------- + ----------- ----------------
    1+beta^2     sensitivity       1+beta^2     specificity


whereas the parameter beta tells how important sensitivity is compared to specificity (beta = 1 means neutral, beta > 1 means sensitivity is favoured to specificity and vice versa). The point is to find a better mix between sensitivity and specificity and hence to improve the overall result.


In the current logistic regression frame, with a decision benchmark of p_hat_glm > 0.5 and a beta = 1, the F1 metrics equals 0.887. Simulating different ranges of the decision benchmark and different ranges of beta values with the aim to optimise the F1 metrics, the logistic regression model drives best with the following combination:


  • decision benchmark: p_hat_glm > 0.28

  • beta = 1.45


Based on the test data and in this combination, the model predicts the probability of non-default with 87.9 % accurately (= sensitivity) and the probability of default with 49.1 % accurately (=specificity) moving the overall accuracy to 79.3 % (F1 equals 0.919). The model accuracy is slightly lower than in the initial approach but the proper prediction of the probability of default more than doubled.

It is to be noted that the decision benchmark/ beta - mix as chosen above improves the specificity of the model significantly without endangering too many non-default cases being specified as a 'probability of default'. See the table below:

                      
                      true value                                                    
                         0     1        
   predicted value 0  6157  1016
                   1   846   981
                   

To see what I mean, here are the results for a decision benchmark p_hat_glm > 0.11 and beta = 1.40. Also here the F1 metrics is comparably high with 0.916 and specificity jumps through the roof with 88.6 %. Though sensitivity is down to 24.9 % leaving an overall accuracy of just 39 %. Keeping the regression model this way would mean that on the one hand most of the default candidates could be covered but on the other hand a tremendous amount of non-default clients would be classified with 'probability of default'. A lot of credit business for the financial institution would be lost which is totally unacceptable. The table below shows the respective dimensions:



                       true value
                          0     1
    predicted value 0  1744   228
                    1  5259  1769


Conclusion

Although the logistic regression model could be improved, the 'prediction quality' for default events is still quite low. Any increase in this direction comes with losing business as a lot of non-default clients would be drifting to the probability-of-default area.

Reasons can be found in the data itself (this topic will be handled in more detail in another article) and the limitations of logistic regression, especially with respect to the inability of covering non-linear relationships in the data. In the next article we have a look at some machine learning kits like k-nearest neighbour, quadratic discriminate analysis (qda), decision trees or random forests and find out if the credit scoring results can be improved.

References

Logistic Regression in R: A Classification Technique to Predict Credit Card Default by Rahim Rasool, published in R-bloggers/ November 12, 2019

The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients by I-Cheng Yeh and Che-hui Lien, published in ScienceDirect/ 2009

The Book of R by Tilman M. Davies, 2016

Data Science: Machine Learning by Prof. Rafael Irizarry/ Harvard University via edx.org (HarvardX: PH125.8x)