Synthetic Data

In 2020 alone, an estimated 59 zettabytes of data will be “created, captured, copied, and consumed.” Without access to data, it’s hard to make tools that actually work. Enter synthetic data: artificial information developers and engineers can use as a stand-in for real data. Kalyan Veeramachaneni and his collaborators recently unveiled a set of open-source data generation tools.

Synthetic data can be used to create more complex data tables. The data can also be shared freely, allowing teams to work more efficiently. A new algorithm uses GANs to build and perfect synthetic data tables, using neural networks. The team presented this research at the 2016 IEEE International Conference on Data Science and Advanced Analytics. The next step is to create a synthetic dataset that preserves relationships, without any identifying information.

CTGAN outperformed classic synthetic data creation techniques in 85 percent of the cases tested in Xu’s study. GANs are more often used in artificial image generation, but they work well for synthetic data, too. The Synthetic Data Vault combines everything the group has built so far into “a whole ecosystem,” says Veeramachaneni.

https://mostly.ai/2020/10/23/why-banks-need-synthetic-data/

https://news.mit.edu/2020/real-promise-synthetic-data-1016