5 Questions To Get Started with Synthetic Data Generation — Innodata

5 Questions to Ask Before Getting Started with Synthetic Data Generation

Accelerating AI with Synthetic Data

There is no question that we are in midst of a technological revolution driven by intelligent decisioning. These decisions, however, are only as accurate and helpful as the data they analyze to continually learn and identify patterns. While there is a glut of some types of data, other types such as financial documents and medical records are difficult or even illegal for third parties to obtain and use for training purposes. Even Facebook, famous for its wealth of data, recently acquired synthetic data company AI Reverie to augment their training capabilities. Gartner predicts that “by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated.”  

What is Synthetic Data?

Synthetic data is generated using machine learning algorithms that ingest real data, train on the patterns of behavior, and then generate entirely artificial data that retains the statistical characteristics of the original dataset. In other words, synthetic data is manufactured to mirror situations in the real world. This is different from the traditional anonymized dataset that is vulnerable to re-identification techniques. Since synthetic data is artificial, it is not subject to the same vulnerabilities. 

Due to the privacy-preserving and artificial nature of synthetic data, it is not governed by data protection laws. This data can be confidently used for analyses and modeling, knowing that it will behave in the same manner as the real data. This simultaneously protects customer privacy and mitigates risk for the companies that leverage it — all while unblocking data that is otherwise frozen behind compliance barriers. 

What Are the Advantages of Synthetic Data?

  1. Privacy. Synthetic data allows companies to build software without exposing personally identifiable information (PII) and/or personal health information (PHI).  
  2. Data retention. New regulations and privacy protection laws limit the retention period of personal data. Since synthetic data is not real, there is no time limit on how long or for what purpose the data might be used. Therefore, it can be held for future analysis that was not feasible at the time of generation. 
  3. Monetization of data. With compliance and security no longer an issue, the generated synthetic data can be used to create a new revenue stream. Since this is synthetic, it can be enriched and automated for even more potential revenue.
  4. Expansion of niche datasets. For instances where the data types needed are rare, the real dataset may have to be supplemented with synthetic data. For example, in most countries there is no standardization for transactional documentation like invoices, bank statements, and utility bills. Each company differentiates their documents by putting in their idiosyncratic design flair through color, font, format, stamps, signatures, seals, etc. For a model to make accurate predictions about the data, it needs an adequate volume and diversity of examples on which to train.  
  5. Augmentation of datasets with gaps or other issues. A dataset may be disproportionate or incomplete, necessitating supplemental data. Let’s say you need data from a specific time period, but examples are limited. Synthetic data could be created to even out representation from that period.
  6. Creation of new data for model retraining. When a model has already been trained, but is not predicting accurately, new training data should be introduced that will address the limitations in the predictions. For example, let’s say your model is not predicting addresses correctly, then you need a synthetic dataset that focuses on different types of addresses, in different formats, in different layouts, etc. 
  7. Correction for bias. Sometimes the natural distribution of data carries bias that we do not want to further proliferate. In these cases, synthetic data can supplement the set to even out representation. For example, in the past homeowners were predominantly male, so an algorithm might assume that a male loan applicant may be a better candidate than a female candidate. To help mitigate this bias, synthetic applications of strong female candidates may need to be added to even out the training set. 
  8. Simulation of unforeseeable events. Real data is always retrospective, based on events that have already lapsed, which make predictions about unique future events nearly impossible. Conditional synthetic data generation makes simulations of unforeseen events possible by adding conditions to the data generator which outputs a synthetic dataset representative of events that have not occurred in the past.

What Are Some Challenges of Synthetic Data Generation?

While there are some amazing advantages of using synthetic data, there are a few challenges:  

  1. Maintaining quality. While quality is always a central concern of training data, it is of particular importance with synthetic data, because the primary purpose of making it is to improve model predictions. High-quality synthetic data captures the same underlying structure and statistical distributions as the data on which it is based. When done well, synthetic data should be indistinguishable from real data.  
  2. Avoiding homogeneity. Variety and diversity are essential to successful model training, because the training data should emulate real-world data. For example, in the real world if your model is tasked with deciphering expense receipts, some of those might be crumpled, have a coffee ring on them, or be photographed in low light. Real-world data would have this variety and it is the job of those creating synthetic data to account for and produce the same type of diverse data the model might encounter.  
  3. Approximating randomness and avoiding erroneous patterns. Neither humans nor classical computers are very good at generating randomness. After a while, patterns emerge that may not reflect the reality of the random nature of the world. So, until quantum computing is more widely available, true randomness in synthetic data will be nearly impossible to simulate. However, there are methods of approximating randomness that can be employed like using conditional generative models. The trick is to introduce the right kind of randomness for training.  

How Is Synthetic Data Generated?

Synthetic documents can be created manually or through generative models like Generative Adversarial Networks or GAN’s, Variational Auto Encoders or VAE’s, and Autoregressive models. 

  • Text can be generated by building and training large language models like Open AI’s GPT3.
  • Images can be artificially generated using Generative Adversarial Network (GAN), StyleGAN2 (Nvidia), or Flip, which can be useful to increase the size and diversity of a dataset. 
  • Single table, multi-table, and time-series datasets stored as CSVs can be artificially generated using a Python framework like SDGym.
  • Privacy-preserving data can be artificially generated by mirroring the statistical properties of the original data.

What Does a Synthetic Data Workflow Look Like?

One common type of synthetic data is documents. Say you want to help people improve their personal finances by developing a model that can accurately categorize their spending habits. You would need a robust dataset of bank statements, which are difficult to get a hold of because of privacy restrictions. Below is an outline of the steps followed in creating synthetic samples of bank statements on which you could train your model. 

Synthetic  Document  Creation  Workflow 

Creating synthetic documents is a multi-step process in which real data is sourced and curated by a team of domain experts. They then design templates based on that found data. Next, they create databases of randomized information, like addresses, names, dates, etc., that fit the required fields needed to populate the template. The information in the databases is then programmatically mapped onto the templates to create the synthetic data.   

The main advantage of the manual approach is that a human can determine specific limitations and needs of the model and design the data accordingly to improve training, whereas automated approaches tend to create more random results. Once criteria and parameters are created, humans continually review the generated documents to ensure they meet the criteria. This ensures that model accuracy actually increases rather than further propagating the issues that already existed or even creating new issues.   

Ensuring Variability When Generating Synthetic  Documents 

One of the challenges of generating synthetic data is creating enough variety to both mimic the desired situation and provide unique instances. There are four general parameters that should be followed:   

  1. Each synthetic document is based on  an  actual document or document template.  Much like a “digital twin,” a structured synthetic substitute for something in the physical world, synthetic data must mirror real life. Having actual documents ensures the new data reflects the types of data the model will encounter.  
  2. Each synthetic document has different combinations of fields or keys.  For example, if a model is being trained to understand bank statements from different countries, then some should have the IBAN number, while examples from the United States will not have this field.  
  3. Each synthetic document has unique values for the provided fields or keys.  If a model is being trained on credit transactions, for instance, then the purchase amounts should vary to reflect different prices for different types of items.    
  4. No more than 10 synthetic documents follow the same document layout.  It is important to limit the number of documents that follow the same layout to ensure variability. If there are hundreds of examples with one layout type, the model will become excellent at identifying this one type, with a very low accuracy on any varying types.   

The Future of Synthetic Data

Synthetic data will no doubt play a more central role in AI model training in the future, as the industry pivots from big data to smart data, privacy policies become more stringent, and dark data imposes more risk. No matter what the use case, accuracy, diversity, and variability are crucial to successful AI training. That’s why a personal consultative approach to synthetic data design and execution, focused on understanding the model’s needs, and offering flexible, iterative solutions are instrumental to synthetic data generation. For other ways to accelerate AI, check out our customized data annotation solutions and SaaS data annotation platforms.

Accelerate AI with High-Quality Data

Check Out this Article on The Latest Framework for Building Datasets & Driving AI Accuracy
The Latest Framework for Building Datasets & Driving AI Accuracy - Innodata

follow us

(NASDAQ: INOD) Innodata is a global data engineering company delivering the promise of AI to many of the world’s most prestigious companies. We provide AI-enabled software platforms and managed services for AI data collection/annotation, AI digital transformation, and industry-specific business processes. Our low-code Innodata AI technology platform is at the core of our offerings. In every relationship, we honor our 30+ year legacy delivering the highest quality data and outstanding service to our customers.