• Christian Schitton

Artificial intelligence and healthcare industry - synthetic data generation

Updated: Apr 26

Introduction


Working with original data is often quite limited. Legal, ethical or business limitations frequently hinder the use of an original datasets which - as an inconvenient consequence - hinders the use of high-tech artificial intelligence tools.


Imagine the potential use of artificial intelligence deep learning architectures for medical image classification in health care industry. Deep learning models take over the classification of diseases based on medical scans, while releasing medical staff from this duty and giving doctors more time to focus solely on the patient.


The point here is that artificial intelligence applications have to be trained for their task. And in order to do this, a lot of data is needed. Though, in most cases the original data are locked away due to the reasons mentioned above. This puts a lot of constraints on the applicability of machine learning tools/ deep learning architectures.


Artificial intelligence and synthetic data generation are offering a way out from this, e.g. for the healthcare industry as well as for any other industry.


Creation by Noise


So - what is in fact synthetic data generation and synthetic datasets?


Synthetic datasets keep the properties of the original data. Nevertheless, a back-tracking to the original data is extremely improbable because synthetic data are built up of “noise” in a probabilistic way without ever "seeing" the original data.


How is that even possible?


Let us say you are in possession of precious original data in the form of an image, such as this:



The artificial intelligence algorithm is starting a process where original data are replaced by synthetic data being built up from scratch, or in statistical parlance: from “noise”:



The synthetic data keeps the properties of the original data while not being retraceable back to it.


However, one does not want to end up with a somehow useless and out-of-range copy of the original dataset - which is in case of our Mona Lisa looking something like this :)


There are several ways to achieve this. Generative Adversarial Networks, Markov Chain Generators or Gaussian Mixture Models are some of them.


In this article, we take Generative Adversarial Networks to explain how synthetic data is created from scratch/ noise.



Generative Adversarial Networks


A Generative Adversarial Network (GAN) comprises two Deep Learning structures. The first structure is called the Generator. The second structure is called the Discriminator.


In short, the task of the Generator is to create synthetic images (i.e. fake images) which resemble the original images in a very close way. The Discriminator is an image classification tool which tries to prevent the Generator getting its fake images accepted.


The basic structure of a GAN:



The basic idea of the Discriminator is that this machine should be able to distinguish between real and fake images. It is initially trained on the original dataset. But with the Generator producing fake images, those fake images are also used to train the Discriminator together with the original dataset. Goal is to improve the classification accuracy of the Discriminator constantly and to make it harder for the Generator to get its (fake) images accepted.


The Generator starts to produce fake images out of a “rubble” of colours. This rubble is called White Noise in statistical terms. After generating the image, the Generator sends the fake image to the Discriminator and the latter decides if the provided image is a fake or an original one. Based on the feedback loop offered by the Discriminator, the Generator itself “improves” its fake images (i.e. feedback in terms of a loss function and improving the images via changing its parameter weights by means of an optimiser function).


So in fact, both parts are improving constantly and are competing to each other.


A very important point here is that the Generator never gets somewhere close to the original dataset. It solely improves by the feedback provided by the Discriminator which trains itself based on the original dataset and the generated fake images. There is no direct connection between generation and checking and therefore no direct connection with the original images.



From White Noise to Images


So, let’s see how this works in practice, how the generation of fake images from pure rubble/ noise is taking place.


This example will be executed in R with Keras (with Tensorflow as backend). Please see the references below for my own sources and impulses and for a much deeper insight.


The Dataset


As showcase we take a dataset consisting of handwritten digits “7”. Here is an excerpt of the dataset:



Here are the handwritten digits in data-object terms:


In other words, the original dataset we are training our GAN with has 6,265 images of handwritten digits provided in a size of 28 x 28 pixels and with 1 channel (as those are grey-scale images). One image in data-format as follows:



This represents the original dataset. Let’s assume now, the task is to generate fake images of the digit “7” which resemble the original ones very closely.



The Generator


The Generator is build up as follows:



And we have the following layer/ parameter structure:




The Discriminator


The Discriminator is built up in the following way:



Resulting in the following layer/ parameter structure:




The GAN Net


Now, the Generator and the Discriminator have to be put together within the Generative Adversarial Network:


In this context, it is important to freeze the weights of the Discriminator.


Reason is that while training the GAN, the Generator weights have to be updated in this way that the probability increases the Discriminator classifies fake images as original ones. Would the weights of the Discriminator not be frozen, it would always classify the generated fake images as original as this would automatically increase the accuracy of the model due to the imbalanced dataset (more original pictures than generated fake images for classifying). In any case, a result which is not intended.



Training the Model


After doing some pre-preparation (e.g. setting the batch size, fixing the number of iterations, normalising the data values and similar), the GAN model is ready to be trained.



As you can see, the whole procedure is started with randomly generated points in the latent space. This is the colour rubble (in this case, the greyscale rubble) we were talking about in the beginning. Statistically it is called “White Noise” which you could see in the graph above and which is produced by a Normal Distribution.




The Results


The task was to generate digits “7” which should be similar to the original dataset. The model was trained in 100 iterations. So, let’s see how this worked out:



As you can see, the training process literally starts with white noise but is able to improve quite fast. Just after 100 iterations, we already get quite usable results for the digit “7”.



Conclusion


The great thing with synthetic data is that it is possible to generate data which is quite similar to the original data by probabilistically approximating the properties of the original data. At the same time, it is highly improbable to retrace the original data via the synthetic data.


Why? Because we have techniques to start out with a bunch of colours without even getting close to the original data. This, we showed in this short example.


Of course, the model as shown here is a rather simple one. And the task of generating synthetic digits which are resembling handwritten digits in greyscale images is also not a quite difficult one.


Though, a lot of progress was done in this area. The topic of synthetic data impacts a lot of high-tech industries, e.g. synthetic data generation for automotive driving, synthetic data helping to detect credit card fraud or synthetic data helping to train image classifiers in the healthcare industry.



To this respect, also the different models available kept up with the pace of development. Transformation Learning helped to push the stage and are of an immense support to run synthetic data.


But this is another story.



References


Deep Learning mit R und Keras by Francois Chollet and J.J. Allaire/ 2018


Generative Adversarial Networks (GANs) with R — Youtube Series by Dr. Bharatendra Rai/ February 23, 2020


Synthetic data in machine learning for medicine and healthcare by Richard J. Chen et.al. published in nature biomedical engineering/ June 15, 2021


Breast Cancer Classification from Ultrasound Images Using Probability-Based Optimal Deep Learning Feature Fusion by Kiran Jabeen et.al. published in sensors, MDPI/ January 21, 2022


Photo Credits in order of appearance:


- Mona Lisa: WikiImages via pixabay.com

- Colour: Bru-nO via pixabay.com

- Fake Mona Lisa: OpenClipart-Vectors via pixabay.com

- Synthetic Medical Images: nature biomedical engineering

- Rest of images and code snippets were created by the Author