How Buddywise uses Fake data for Real (deep) learning!

Buddywise synthetic data for forklift detection

“Data is food for AI” is classic phrase by Andrew Ng, stressing the significance of using high quality data in deep learning training pipelines. But what do you do if there’s scarcity of high quality data and acquiring more data is almost impossible? That is often the case with personal data due to privacy regulations. Our answer: synthetic data.

In this article we want to highlight the increasing importance of synthetic data as a tool for training deep learning models, especially in the field of Computer Vision. We’ll also explain the reasons why we chose to follow this kind of unconventional path.

Buddywise is a data-driven company and we apply a plethora of processes to assure the integrity and the quality of the data we use. The inclusion of synthetic data in our training pipelines is a key aspect of our multi-facet approach to train our deep learning models and deliver accurate and reliable Computer Vision solutions. We believe synthetic data will be key to safeguarding the lives of workers in heavy industry.

What is synthetic data and why is it important?

In 2017 scientists from MIT wanted to compare the performance of deep learning models trained on real and synthetic data respectively. Thence hired 39 freelance data scientists and split them into 4 teams; one team experimented on real data and the other 3 teams on synthetic data. The groups that used synthetic data were able to produce results on par with the group that used real data 70% of the time. Since 2017 the development in this area has been rapid and synthetic data has been established as a real alternative to real data. Gartner now predicts it will overshadow the use of real data by 2030.

Synthetic data will become the main form of data used in AI. Source: Gartner, “Maverick Research: Forget About Your Real Data — Synthetic Data Is the Future of AI,” Leinar Ramos, Jitendra Subramanyam, 24 June 2021.

Compared to the generation of data by actual events (real data), synthetic data is either produced by artificial data from scratch or advanced data manipulation techniques to create novel and diverse training instances.

A photorealistic synthetic image generated from 3D models. Source: unity.com

The inclusion of synthetic data in a deep learning pipeline can offer many benefits from different perspectives. Below we present the main reasons for which we choose to use synthetic data.

The abundance of training samples

The development of synthetic-data generators is a one-time investment. As a result, one can get a potentially unlimited number of labeled data for e.g. RGB images, segmentation maps, synthetic video clips, and other modalities. Once created, the supply becomes abundant.

Low-cost / time-efficient

In order for a company to own a fair amount of quality real data for a specific use case there are two options (1) it should either collect, clean, and annotate data on its own or (2) buy data from other companies that offer the above services. The first option is time-consuming and expensive, requiring specific domain expertise. The second option can cost as much as €1 per image, with a need of many thousands for each model. The acquisition of synthetic data is automated and the cost per image can be as low as €0.01 .

Enormous diversity

The success of deep learning models heavily relies on the quality of the dataset that is used for training. One aspect of this quality is diversity. The more variability a dataset has the more accurate predictions a model can eventually do. Under a synthetic data creation regime, the combinations of different features/scenarios are practically vast and, thus, the contribution of such training samples can be extremely rich.

Edge-case coverage

It is risky to trust a deep learning model on things that it has never seen during its training process, especially for applications in which human lives are involved. We deal with many high impact edge cases, which are rare to occur in the real world and thus difficult to capture and feed to our models. Through the usage of synthetic data, such incidents can artificially be orchestrated and teach the model to identify them.

Privacy protection compliant

Synthetic data is non-identifiable information, meaning it is exempt from data privacy regulations. The exception is in cases where real environments and people are replicated to better reflect a specific site. However, even for these cases, privacy regulators indicate quite encouraging signs on the topic of synthetic data, as they start to grasp the numerous benefits both economic and societal-wise. In France, for instance, the CNIL recently accepted an SDG approach as a valid kind of anonymization. Given that no such designation has been given to other anonymization approaches before, this shows they consider data synthesis to be more dependable than other approaches.

Challenges of using synthetic data

Nevertheless, each coin has two sides. Synthetic data could not be considered a cure-all and we should be aware of the pitfalls.

Synthetic data is synthetic

Synthetic data is rarely a perfect reproduction of genuine data. Consequently, the distribution of the synthetic dataset is unlikely to match the target distribution of our problem, implying that the model would fail to learn how to make accurate predictions under real-world conditions. To tackle this issue we combine and test hybrid datasets, in a systematic way, by injecting real data until our models are accurate and reliable enough under our evaluation protocols.

Bias

Bias occurs when people gather data in a way that the samples do not represent the population of interest accurately. Bias is an inherent challenge for every dataset’s creation, especially for human-created data sets. Bias can even creep into synthetic datasets when they have been created based on such collections of real data. For that reason, before training any model we deeply analyze and comprehend our data. After we identify its vulnerabilities we can cover them by creating synthetic data in a targeted way so as to eventually have as representative of real-world samples as possible.

Demonstration against racial bias in London. Photo: John Cameron

Conclusion

Synthetic data that reflects the important statistical properties of the underlying real-world data can solve a wide range of problems. It is inexpensive compared to collecting large datasets and can foster deep learning model development without compromising customer privacy. It’s estimated that by 2024, 60% of the data used to develop AI and analytics projects will be synthetically generated.

Synthetic data is a very promising way to tackle problems that real data will not be able to confront in the future. At Buddywise, we are proud to be on the front line of innovation and apply the latest technology in order to keep people’s lives safe!

--

--

--

Buddywise is a start-up hell bent on saving lives and preventing injuries with computer vision!

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How Do You Measure If Your Customer Churn Predictive Model Is Good?

Creating voice assistant for games (tutorial for FIFA)

F1 Manger Part Selection with Pandas and Linear Regression

The Data Fabric for Machine Learning. Part 2: Building a Knowledge-Graph.

The NOT definitive guide to learning math for machine learning

What is Really “Fair” in Machine Learning?

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (Paper Explained)

Machine learning with SQL

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Buddywise

Buddywise

Buddywise is a start-up hell bent on saving lives and preventing injuries with computer vision!

More from Medium

Write your first Multi-node GPU training script with PyTorch using SLURM and Singularity.

IBM super computer

Hand Segmentation with Python and Tensorflow

A Look Under the Hood of Pytorch’s Recurrent Neural Network Module

Introduction of Graph Convolutional Network (GCN) & Quick Implementation