• Christian Schitton

How to solve data availability problems by creating synthetic data?

Lack of Data

Digitalisation and the incorporating of high-tech analytics standards more than often find their natural constraints in the availability of suitable data.

Limitations can be numerous but there are three main categories which showed to be most problematic :

Restricted Data: Data can be highly sensible, confidential or simply restricted by law.

Using restricted data runs the risk of facing the leakage of data, a compromised data pool, unauthorised usage of information, breach of business secrets or health care data respectively the misuse of clients’ data.

Laws and regulations, e.g. the European General Data Protection Regulation (GDPR) or national bank secrecy laws, provide severe punishment for companies and individuals should those risks materialise.

Comparability of Data: Data is in principle available but the size is too small and/ or too less comparable to other data sets in order to be useful for those high-tech, but very data-hungry, analytics instruments like Deep Learning.

Just imagine a small- to mid-range non-performing loan portfolio which just covers a certain geographical area or includes a special type of collaterals.

It might be hard to find similar data in order to increase the sample size for further handling.

Low Frequency Data: Sometimes the “nature” of data creation is slow, i.e. the frequency of data generating is too low.

The differences to high frequency data can be overwhelming. For example, maybe you compare the broad field of Internet of Things (IoT) where sensors produce data points sometimes in intervals of milli-seconds to the availability of market data in commercial real estate. The latter has its market updates on a quarterly basis (on a monthly basis if you are lucky). In any case, those are two different worlds in terms of data analytics possibilities.

Low frequency data implies a serious challenge for any kind of state-of-the-art analytics tools.

How to solve constrained data problems?

There are several ways to address those constraints.

Shared data pools or open data initiatives are one of them. Though, in those cases a lot depends on the willingness of potential participants and certain limitations, like legal regulations on sensitive data, can hardly be overcome by those initiatives.

Luckily, data analytics is enabling us to use very interesting and effective methodologies where we put the original data aside — and work with synthetic data instead.

Synthetic Data — How It Works

Synthetic data is created based on the original data without having any impact on any element of the original data identity.

It has an extremely high level of similarity to the original data, reacts the same way in case of interventions respectively gives similar results in analytical check-ups.

On the other hand, synthetic data prevents unauthorised observers to retrieve information from the original data set.

The way to satisfy those seemingly conflicting goals is — you may guess it — through statistics.

Statistics helps to replace some or all of the original observed values by sampling them from appropriate probability distributions. Doing this, the essential statistical features of the original data are preserved.

Statistics? Probability distributions? Synthetic? What…?

Let’s get easy and start on the level of a variable.

A variable (or feature) is a parameter in a system of interest which might have an influence on some other variables.

This impact may be towards other influencing features (i.e. predictors) respectively towards the variable of final interest (i.e. the dependent variable). As an example, the variable ‘income level’ as predictor could have impact on a borrower’s capability to repay a loan (i.e. dependent variable ‘credit default’).

And exactly this influencing variable shows a behaviour which can be explained by statistical properties and therefore can be expressed by means of probability distributions. In this case, we talk about marginal distributions.

For instance, take the (market) variable ‘investment volume’ for a certain real estate class in a local market. The graph below shows the ‘behaviour’ of this variable in terms of how often a certain investment range appears in this market within a specific period of time. Obviously an investment range between MEUR 150 an MEUR 300 is most common in this market.

The behaviour of this feature can be replicated by a marginal distribution as follows:

The next phase in creating synthetic data is to take the relationship among those variables involved into account. We enter the stage of conditional distributions.

To put a long story short, we simply want to know how one feature behaves provided another feature shows a certain outcome.

Here is an example of the investment volume in a market conditioned on the change in the investment yield in the same market. On the left side in blue points are the original data as observed in this market environment. On the right side in red points are the simulated data based on the conditional probabilities as given by the original features.

Putting all those things together in a joint distribution finally enables us to create synthetic data.

The synthetic data are very similar — almost same — to the original data pool.

Additionally as the way of generating the synthetic data focuses on the processes which enabled the original data rather than the original data observations themselves, there is no “physical” connection between the original data set and the synthetic data set.

The revealing of original data via the synthetic data is therefore almost impossible to achieve.

Hence, two conflicting goals (similarity, non-exposure of original data) are matched with one approach.

Synthetic Data — An Example

Now, let’s see how this works in practical terms.

As an example, we take a portfolio of credit card loans. This portfolio consists of 30,000 single loan exposures and it is defined by variables like loan amount, education, sex, marriage status, payment status per month, invoice status per month and similar. The feature of interest is of course default/ non-default of the respective borrower.

Following the steps as described in the previous chapter, a synthetic data set is generated which closely resembles the original loan portfolio. Here is an excerpt of some of the features of the credit card portfolio. As you can see, both data sets (original and synthetic) are very close:

Of course it does not stop there.

With respect to the problem of comparability of data sets, the same original data also can be replicated several times with synthetic data.

As a consequence, similar but not authentic data sets can be generated in order to train higher sophisticated analytical tools, e.g. machine learning applications.

Coming back to the credit card portfolio, here is a comparison of the original portfolio variable ‘credit limit’ with several synthetic offsprings:

The similarity of original and synthetic data sets are obvious.

Synthetic Data in Action

In order to see if we can expect results from the synthetic data to be similar to the original data, we developed a simple credit scoring model (i.e. prediction of default/ non-default of borrowers) driven by logistic regression.

Without going into detail, we focus on the performance of the credit scoring model. Here are the model performance results of the original data set:

and here are the model performance results of the synthetic data set:

It is quite clear that both results are showing a high level of similarity.

Therefore, we see that the synthetic data is capable of replacing the original data in the credit scoring model.


Due to the ever growing complexity of the global data generating mechanisms and the progress of data analytical tools, businesses are getting more and more aware of their value and are trying to limit their use. On the other hand, data regulations are developing by the same speed and we can expect an increased number of restrictive laws with respect to using them for analytical purposes.

In this context it becomes clear that methods such as synthetic data creation will become the core of future analytical solutions due to their flexibility and adaptiveness to a wide range of analytical tools.

We do have efficient ways to overcome current and upcoming constraints in the different industries with respect to data availability, data comparability, confidentiality or legal limitations.

Furthermore, synthetic data are definitely a serious alternative in the effort to improve the capabilities of analytical tasks in a corporate entity, especially with respect to operative risk and financial risk management problems.

It is a fast and cost-efficient way to incorporate state-of-the-art analytics and to move from pure Business Analytics to a more sophisticated Predictive Analytics.