Synthetic data is algorithmically generated information that imitates real data. Synthetic data can substitute for datasets used for testing and training in artificial intelligence (AI) and machine learning. To generate synthetic data, algorithms are fed with smaller real-world data and produce similar data. .
Using synthetic data is an approach to solving problems in AI that come from insufficient data by producing artificial data from scratch or producing novel and diverse training examples using data manipulation techniques. Synthetic data can provide a solution when data sets are too small or the cost of manually labeling data are prohibitively high. Synthetic datasets are cheaper to produce than traditional ones
Synthetically generated datasets can be used to train machine learning models, particularly in computer vision. Synthetic data my augment real datasets to cover parts of the data distribution that are not sufficiently represented to alleviate dataset bias. Synthetic data may also be useful when real data is impossible or prohibitively difficult to acquire due to privacy or legal issues. Synthetic data has been used to train Google’s Waymo in the form of driving simulations. Facebook was reported to use synthetic data to train algorithms to detect bullying language.