Unlike original data, synthetic data do not originate from real sources but are algorithmically produced (synthesized) to mimic the characteristics of real data. The process typically involves creating physics- or statistics-based models built based on patterns found in the original dataset. Synthetic data was first used in 1993 for creating a subset of anonymized data in the U.S. census. While creating nonreal census data that includes human information like name, address, telephone number, social security number, and credit card number is not so complicated, synthetic data has come a long way, and now, more complex data like images and factory floor data can also be synthesized.
Synthetic data is not binary but a spectrum
The process of creating synthetic data by mimicking features of the original dataset can range from as simple as adding entirely random (but sensible) data to the dataset, to changing some features of real data, to as complex as creating new data entirely from scratch. As an example, Gaist Holdings takes pictures of road surfaces in good condition and digitally adds surface cracks to create thousands of synthetic images of poor road surfaces to train its algorithm. On the other hand, Anyverse uses its customizable camera sensor model to generate synthetic images from scratch and without building upon real images.
Fig. 1: Spectrum of synthetic data with examples
There are two main applications of synthetic data:
- Data anonymization: Both industrial and consumer data are increasingly becoming sensitive assets that require strong security and privacy measures. We have previously discussed that data anonymization is one of the many paths to data privacy. One of the methods by which real data can be anonymized and shared across unauthorized parties is by adding synthetic data to the dataset. A very recent example involves the IRS's use of synthetic data in testing new IT systems. Similarly, synthetic data is also used for other quality testing applications, such as detecting credit card fraud.
- Training machine learning algorithms: Machine learning (ML) algorithms, instead of working based on preset rules, look at examples and define the rules themselves. While there are numerous advantages of using ML algorithms over traditional computer algorithms, one caveat is that sophisticated ML algorithms like deep learning require a lot of annotated training data. IT companies like Google and Amazon haven't struggled to get access to consumers' data for training their ML algorithms. However, physical industries like chemicals and manufacturing have been held back from unlocking the true potential of ML due to the lack of sufficient and diverse training data, especially in applications like inspection systems and identification systems, where training data is difficult to obtain. As an example, designing a digital twin for a chemical plant will require the algorithms to be trained with all sorts of possibilities, including failure scenarios. But if there isn't enough data available on situations like a complete plant shutdown, the algorithm might not be robust enough to detect such a black swan incident. In another example, while images collected in real-world situations are used for training autonomous cars, they cannot capture all edge cases, such as a bright snowy day in a desert, a child suddenly appearing on the road while the vehicle is in motion, or a dead animal on the road. Although technically possible, it is difficult or highly dangerous or unethical to simulate such situations and take photographs for training ML algorithms. Instead, synthetic data can be used, which solves the problem by increasing the diversity of the dataset. As the quality of synthetic data improves, their use in computer vision applications for autonomous vehicles and robotics is also increasing.
Synthetic data offer several advantages over the original real data, such as:
- Data diversity: Obtaining data only from easily available sources can lead to bias in applications. As an example, Apple recently faced a backlash for its credit-issuing algorithm that had a gender bias. Similarly, most facial recognition algorithms have come under serious criticism because of racial bias to the point that IBM has binned the program completely. These failure examples are the results of the use of easily accessible data, which shows that the use of synthetic data to remove bias is crucial before allowing those applications to be used for critical assessments like police investigations.
- Privacy: Adhering to compliance with privacy regulations like GDPR and CCPA has become a big hurdle that can cost enterprises up to $1 million. With the use of synthetic data, companies can circumvent such regulations.
- Demand-based specifications: Synthetic data can be created on demand and with specifications based on consumer demands, including those of edge cases. This is a useful feature for industries that require hard-to-find customized data.
- Cost: Industrial data, especially multimedia data like images, require a lot of expensive manual tasks, such as photography and annotations. Synthetic data can be produced faster at scale and costs less than real images. LexSet claims that it is able to produce annotated synthetic images at less than 50 cents, which typically cost $1 to $4 to obtain from real sources. Although it doesn't seem significant, the cost benefit of synthetic data is important, as training deep learning algorithms typically requires around 100,000 images. The performance of ML algorithms is proportional to the amount of training data (the learning curve plateaus after a certain point).
Despite several applications and advantages, there are still challenges in using synthetic data.
- Data randomness: Complex data sets like optical images and radar signals are difficult to model and reproduce synthetically because there is a lot of randomness in them. Inability to model such naturally occurring unidentified data randomness can result in the generation of programmer-biased data sets, which is too mainstream. The use of such biased synthetic data to train other machine learning algorithms will further exacerbate the failures caused by the lack of data diversity.
- Reality gap: The measure of the difference between real data and synthetic data is called the reality gap. A good piece of synthetic data has a minimal reality gap. Due to the complexities of modeling feature sets of original data correctly and the fact that synthetic data is based on our perception of the real world, synthetic data can become too artificial. The use of such data to train algorithms doesn't achieve good performance. The big reality gap and subsequent failures experienced during the inception stage (the early 2000s) have led to a bad impression for synthetic data in the industry. Therefore, creators of synthetic data face the challenge of having to demonstrate their new capabilities for wide adoption.
- Complexities: Creating synthetic data for systems where interactions between many factors lead to severe nonlinearities, such as a hyperspectral dataset or a physics-based model of a chemical plant, is still a very complicated process. In addition, although simple data like text entries are easier to synthesize compared to graphical data, small nuances in the features of those simple data can make a big difference in the dataset. For example, while adding different intensities of noise to a picture of a dog will still leave an image of a dog, the addition of a word "not" to a sentence used for NLP applications can entirely change the meaning of the sentence. These complexities require continuous fine-tuning by a team of data scientists, which is often difficult to find and assemble.
- The importance of big data in building robust applications in both industrial and consumer settings is increasing. However, at the same time, the process of obtaining sufficient data for training AI models is still difficult for physical industries, while sharing them across stakeholders is getting more challenging.
- Companies should monitor the synthetic data space because it can fill the gap between the supply and demand of big data while also reducing the complexities of sharing data.
- Despite several advantages, synthetic data is still not a fully mature technology, especially for more complex data structures like images and physics-based models.
- Raw, unfiltered, and unprocessed data should be used, whenever possible, to train synthetic data models to prevent human bias from tainting the algorithms. Lux members should refer to this insight on how to avoid introducing bias into AI models.
- Market education regarding the capabilities and importance of synthetic data is a challenging task for enabling its wide adoption.