Menu

Synthetic Data : A Very Brief Intro to a Very Exciting Area of Data Science

April 14th, 2022
You’ve probably heard of deep fakes in the last few years - If not, on which side of the internet are you?

Sites like thispersondoesnotexist.com generate what we call synthetic data. In the case of this site, images of people who don’t exist. The images generated by the neural network are subjacent to the website. This neural network was trained on real images to create fake ones. 

In sum, synthetic data is “any production data applicable to a given situation that is not obtained by direct measurement” according to the McGraw-Hill Dictionary of Scientific and Technical Terms. 

A Very Brief History of Synthetic Data Generation

Research into the synthesization of data dates back to the 1930s when the first research into the synthesization of audio and voice can be traced back. The rise of digitization in the 1970s gave way for software synthesizers to appear. 

The first application of synthetic data generation for the privacy of the original data can be dated to 1993 to Donald Rubin, an emeritus professor of statistics at Harvard. He conceptualized the usage of algorithms to create a fully synthetic version of the Decennial Census, thus anonymizing the original data and keeping people’s privacy while keeping the statistics of the original dataset intact. 

During the 90s and early 2000s the techniques to generate synthetic data diversified with the usage of algorithms such as Bayes Bootstrap, parametric posterior predictive distribution, or Sequential Regression Multivariate Imputation.

A significant jump in the usage and quality of synthetic data appeared in the 2010s with the increasing usage of neural networks in synthetic data generation and the diversification of such neural networks with Generative Adversarial Networks (presented in this paper in 2014) arising to popularity.

Generative Adversarial Networks (GANs) are an interesting concept. In a simplified way, we have two networks, a generator, and a discriminator, both learning from the original data. The generator acts as an art forger, it begins to attempt to generate random pieces of data which in the beginning do not resemble the original but with time and learning will become better. The discriminator is like the police, it’s function is to distinguish what is real data and what is data generated by the generator. With time our art forger (generator) and our police (discriminator) will become increasingly more effective and the pieces of data that could fool the discriminator are truly data almost indistinguishable from the original.

The

GANs, because of their straightforward implementation, are amongst the most used methodologies for synthetic data generation, from videos to images and even simple tabular data.

Nowadays the implementations and creation of synthetic data have diversified, with algorithms such as the usage of the aforementioned GAN’s and LSTM neural networks as the top players. From tabular data to images to sounds and videos, synthetic data has come to stay.

Usages of Synthetic Data 

Synthetic data has 3 main usages :

  • Privacy -  when the privacy of the original dataset is paramount and yet data is needed to keep the models in production, generating an artificial version of the dataset to place it into production is one of the most effective ways to keep the original data safe and private;

  • Data augmentation - when the amount of data required to test a hypothesis is too small and hypothesis testing is required to decide on further data acquisition (ex: testing a hypothesis on a small cohort of patients to decide if it’s worth increasing the cohort size) synthetic data can come in handy as proper application of synthetic data generation methods can actually help do some data augmentation and allow that hypothesis testing to move forward. Areas where data acquisition is costly and takes a long time such as medicine can especially benefit from this. 

  • Training Data is costly -  some areas of artificial intelligence such as self-driving cars require large amounts of training data to train their algorithms. However, generating such training sets with real-life data is costly. Synthetic data generation can keep the costs in check during the training and development stages of such algorithms. 

Challenges and Benefits of Synthetic Data

Being able to mimic real-life data has its challenges and benefits. While it seems it can be limitless in generating scenarios for testing and development, it’s important for us to remember that any synthetic models deriving from data can only replicate specific properties of said data, meaning they will ultimately only be able to simulate general trends. 

But that doesn’t leave synthetic data without its benefits. It allows us to:

  • Overcome real data usage restrictions: Real data such as patient and medical data has usage constraints due to privacy rules and regulations. Synthetic data replicates the important statistics of real data without exposing it, thus eliminating this issue;

  • Creating data to simulate situations that haven’t been encountered yet -  when real data does not exist, synthetic data is the solution;

  • Immunity to some statistical problems - such as item nonresponse, skip patterns, and other logical constraints;

  • Focus on relationships: Synthetic data preserves the multivariate relationships between variables instead of specific statistics alone. 

Sounds like a perfect way to generate datasets right? There are also a few challenges to it. These are just a few:

  • Outliers may be missing - Synthetic data mimics the real-world data that it has learned from, however, it's not an exact replica of it. So, some outliers can be missing in the synthetic data that the original one has. 

  • The quality of the synthetic data is directly dependent on the original data it learns from - if the original data is incomplete or has biases, the synthetic data will reflect the same.

  • User acceptance can be challenging - as with all emerging concepts, it will have to overcome user biases and fears to be accepted as valid and usable data.

  • It requires time, effort, and a lot of work to generate - while not as much as real-life data in many industries, it is also not free and requires specialized teams to be able to be produced with quality;

  • Output control is required - a deep control of output and regular comparison with real-life data is mandatory to assure the quality and reliability of the synthetic data.

A Simple Example of Synthetic Data Generation with Tabular Data

As you can see synthetic data is on the rise and its importance is paramount. So let’s check a simple example of how you can learn to create a synthetic dataset and use this technique in your daily life as a data person. We’re gonna see an extremely simple example with tabular data as it is the most commonly used data in data science projects within companies. 

We’re gonna go through a Conditional Tabular GAN (ctGAN) example and show you how you can create synthetic tabular data. We’re gonna use a default ctGAN structure for this example but you can definitely tune the parameters of this neural network. 

For this example, we’re gonna use the Pima Indians Diabetes Dataset from Kaggle as an example. This dataset portrays a common case where data privacy is paramount. It joins the health information of a group of females of Pima Indian heritage and attempts to predict their diabetes risk. 

Watch the video or carefully read the instructions below to create your own synthetic dataset. Good luck! 💪

For this exercise, we’ll need to install the package CTGAN, the package SDV, and the package Table Evaluator. To make sure the table evaluator works properly a specific version of seaborn, 0.11.1 , should be installed using python -mpip install seaborn==0.11.1

First, let’s import all necessary packages.

Then we’ll need to import the data.

At this stage, it’s important to validate the dataset and check if it has missing data as CTGAN fails if the dataset is not clean. 

Finally, we can start the CTGAN implementation. First, we’ll need to declare which are the columns with discrete variables.

And then configure and execute the synthesizer. You can control the number of epochs, batch_size, and the dimensions of the generator and discriminator. Verbose can be set to true if you want to accompany the training of the CTGAN. 

After some time (CTGAN takes quite a bit of time to train), you can generate synthetic data as simple as below.

The final step is evaluating this synthetic dataset. Two tools, the table_evaluator, and sdv.evaluate are useful here. TableEvaluator is a library to evaluate how similar a synthesized dataset is to real data. In other words, it tries to give an indication of how real your fake data is.

As you can see in the PCA below, for example, the fake data has grasped the overall trends of the real data and the distributions are fairly similar. 

In order to obtain an objective metric of comparison between synthetic data and fake data, we can use the SDV evaluate function to analyze the similarity between real and fake datasets. This function displays aggregated results of all of the similarity metrics from 0 to 1, where 0 begins as worst and 1 is ideal. As you can see below the results, while not being awful, still leave some improvement margin in our CTGAN that could be conquered by tuning the parameters of the synthesizer. 

You can check the full code for this project, here.

I hope you enjoyed this mini intro to this very exciting area of Data Science!

Would you like to know more about the subject? Here's an article about what is Data Science and its current challenge.

Till the next voyage in this galaxy of data! 👋

Leave a comment