The dangers of synthetic data: Training AI's with simulations
08-07-2022 | By Robin Mitchell
A recent VentureBeat report discusses how synthetic data can help improve AI. However, engineers need to be extremely cautious when using such data as this data, at the end of the day, is not real. Why is data gathering a big challenge, what is synthetic data, and how does it threaten AI training?
Why is data gathering a big challenge for AI?
While the concept of AI has been around for decades, it has only been in the past decade that its use has accelerated. One reason for this is that hardware platforms designed to accelerate AI algorithms have only started to recently come out, meaning that older AI algorithms were inefficiently executed on generic CPUs. However, by far the most prominent reason for the sudden explosion of AI is the unfathomable amount of data available to engineers to train AI.
Fundamentally, neural networks (the mechanism that drives AI) are surprisingly simple. A neural network consists of functional blocks that take in multiple inputs, multiply these input signals by weights, and then add these weighted signals together to produce an output activation signal. These nodes are stacked in discrete layers that pass on their processed signal to additional node layers. By changing the weights of these signals, the network can be trained to correctly produce outputs based on specific inputs. For example, a neural network can be trained to recognise cats with an entire image trickling down into one of two outputs, cat or no cat.
In order for an AI to be trained, it needs data, and the more data provided to an AI, the better it operates. With the cat example, it would be imperative that the AI is not only shown multiple images of each cat but of each breed as well. This way, the AI is trained to recognise cats by common features such as tails, eyes, ears, and body shape.
And it is this requirement of large datasets of different variations all carefully annotated that AIs find trouble. Before data protection laws came into practice, AI developers could turn to large tech companies such as Facebook, Google, and Amazon for customer data that could include conversations, messages, images, behavioural patterns, and site history. But the introduction of strong data protection laws (such as GDPR) now prevents companies from obtaining data without strict permission. Companies can get around these restrictions by tricking customers into training in-house AI solutions from data they gather directly, with one common example being Captcha tests (select all the images with a fire truck trains AI while also providing security).
What is synthetic data?
One solution to the data challenge is to create synthetic data. As the name suggests, synthetic data is artificially generated data that closely resembles real-world data and is often sourced from simulation. For example, a program can be designed to create brand new handwriting styles (this already exists), large portions of text can be written in these styles, and images of this text can be automatically parsed into a machine learning neural net.
The significant advantage of this is that the original text is already digitally known, which means that the AI’s output can be compared to what was computed originally. Thus, a closed-loop system can improve the ability of the AI to recognise handwriting while never needing to see real-world examples of handwriting.
Another potential application for synthetic data is in autonomous vehicles. For autonomous vehicles to be safe, they need to prove that they can operate under all conditions, which is very hard to achieve in real life. Instead, hyper-realistic simulations can be constructed that can quickly train autonomous AI on how to drive through traffic, spot potential dangers, and react to unexpected events such as swerving vehicles, falling trees, and law enforcement stops.
How does synthetic data present a danger to AI?
A recent article from Venture Beat describes the advantages of synthetic data and how it can help AI, and while there is truth in this, engineers should exercise caution with synthetic data. Of course, synthetic data allows AI to be trained in a controlled simulated environment, but it must be understood that such data is not real. Synthetic data may closely resemble real-world data, but at the end of the day, it is not real.
Like with climate change, financial predictions, and determining the nature of the universe, everything comes down to the accuracy of a model, and it is frighting how easy it is for a model to be manipulated. The adjustment of a few constants here and there can drastically change the output of a model to fit some intended purpose, and this manipulation could quickly spread into an AI.
For example, creating a climate model using current data that doesn’t support the current consensus can be altered so that it does, and a climate AI learning from this model would be fed potentially false information. Another example of manipulated models could be the creation of simulated social media data with large amounts of political biases and stereotypes that don’t conform to real-world data. Training an AI to learn from this would result in an AI that can only see the world through the eyes of the manipulated model, and this could be dangerous if deployed in a real-world scenario such as account monitoring, flagging, and automated bans.
Overall, AI is only as good as the data it’s fed, and humans tend to use statistics and models to prove their beliefs. Synthetic data can be used to aid AI in learning, but extreme caution should be taken as, at the end of the day, synthetic data is not real.