Simulated data: the great equalizer in the AI race?

Information theoretically speaking, how do you generate a "synthetic" dataset (as the article calls it) with the same fidelity as an original dataset without having access to a critical basis set of the original? What would you do to obtain that fidelity? Extrapolate from sufficiently many independent conclusions drawn from the original?

And as a followup, if you can generate a synthetic dataset by extrapolating from sufficiently many independent conclusions drawn from the original (as opposed to having access to the original itself), would you still need to use such a dataset for training?

Things like Monte Carlo simulation can be used to approximate real world conditions, but they can't typically capture the full information density of organic data. For example, generating a ton of artificial web traffic for fraud analysis or incident response only captures a few dimensions of what real world user traffic captures.

The author talks about simulating data to focus on edge cases or avoid statistical bias, but I don't see how simulated data actually achieves that.

paggle · 6 years ago

It's not about creating information in the information-theoretic sense. It's that nobody knows how neural networks work. So even if, say, all of the knowledge to recognize that this is a screwdriver is present in the 3D model, it's not like we know how to train a neural network to detect those features from a 2D camera image. So we just generate a gazillion 2D images in various rotations and lighting conditions and let the neural network use its black magic. With no "backchannel" into the "brains" of the neural network, we can't tell it to recognize black people's emotions the same way it recognizes white people's emotions. So if we don't have enough black people in our training set, we build a way to simulate an image of a black person and hammer it into the neural network's brainstem.

jeromebaek · 6 years ago

> Information theoretically speaking, how do you generate a "synthetic" dataset (as the article calls it) with the same fidelity as an original dataset without having access to a critical basis set of the original?

Information theoretically speaking, this is impossible. The synthetic dataset will always have exploitable mathematical properties in a way non-synthetic dataset will not. It will open up the trained model to easy adversarial attacks.

ska · 6 years ago

You don't. It's not useless as a technique, but it is limited. IMO more limited than the article's author presents, but they do touch on some of the useful bits.

gilbaz · 6 years ago

Cool points -

"...original dataset without having access to a critical basis set of the original?"

I think that they're not trying to copy existing datasets but are trying to generate new datasets that solve various computer vision use-cases. Looks lke they're using 3D photorealistic models and environments to then generate 2D data. It is a cool idea, if they had the ability to synthesize a large amount of 3D people and objects and insert them into 3D environment in ways that made sense and then run motion simulation, they could hypothetically create an incredible amount of high-quality data. Sounds pretty hard to do honestly...

I think Monte Carlo is used for something very different than computer vision / machine learning. Monte Carlo is usually used to estimate an average result given many dependent variables and a simplified model of the problem. So if I want to estimate how far my paper airplane will fly and I have a simulator, I would vary the paper thickness, folds and wind. Each time I would run the simulator, get a result and then I can estimate the average distance the paper airplane would go! (actually sounds like a fun project lol). Anyway this is just different.

Simulation is good for edge cases because you can simulate them disproportionally to their prevalence in the real world. So let's say that we're in a smart store and we want to recognize when an elderly person falls on the floor to send human help to the correct location. This happens maybe one in 5 year in a given store. If we were to gather data we may get 10 examples. If they can simulate this, they could simulate 100k elderly people falling and then train models to recognize it! Kind of crazy really.

deehouie · 6 years ago

But then how do you simulate, or imagine all the possible ways of falling and all the possible places this could happen? You have one sample, that's all. Ultimately, you have to use domain knowledge, but domain knowledge comes from observed data. High fidelity comes from having a lot of data. This takes you back to ground one.

TrackerFF · 6 years ago

Hasn't that been a thing with at least car/vehicle detection etc. for a while now?

Generating tons of data from simply using decent 3D renderings, made with game engines etc.

cbrun · 6 years ago

All good points. In this case, the original dataset is created from real world body scans. You collect enough scans in this "base collection of scans" to have a "real" distribution of the world. You can then span a latent space on top of this initial distribution and use GANs to further scale it. This isn't as good as real yet, but it generates results that are better than limited quantities of real data alone. Agree with your point around the Monte Carlo simulation. Synthetic data is not the be-all end-all to train neural networks.

yshcht · 6 years ago

I think the work on domain randomization for visual data is something that can be worth exploring. https://lilianweng.github.io/lil-log/2019/05/05/domain-rando...

proverbialbunny · 6 years ago

How you do it is similar to this talk: https://youtu.be/MiiWzJE0fEA

It's not perfect. You do need real world data. Synthetic data is great for creating edge cases. Say you want your model to classify more general use cases, but your real world data is limited. You can generate data that fits edge cases with some variance and the model will accept more variety, reducing over fitting.

acollins1331 · 6 years ago

Interpolation between known sets? You could have an envelope of real world conditions and create thousands of samples that are variations in between for purposes of training.

There are a couple points not generally made in discussions of data and the great AI race.

Most data is crap. So the mountains of data that are supposedly an advantage, in China or in the vaults of the large corporations, are not fit for purpose without a tremendous amount of data pre-processing, and even then... And that means the real chokepoint is data science talent, not data quantity. In other words, in many cases, the premises of this statement should be questioned.

Secondly, a lot of research is focused on few-shot, one-shot or no-shot learning. That is, the AI industry will make this constraint increasingly obsolete.

Thirdly, with synthetic data, it is only as good as the assumptions you made while creating it. But how did you reach those assumptions? By what argument should they be treated as a source of truth? If you are making those assumptions based on your wide experience with real-world data, well then, we run into the same scarcity, mediated by the mind of the person creating the synthetic data.

gilbaz · 6 years ago

Good points!

Most data is def crap lol.

I don't see few shot, one-shot or no-shot getting anywhere close to standard supervised learning for anything practical. It really doesn't make sense in production settings at all.

You have a function that you want to learn, let's say mapping between an RGB image to a segmentation map. For most applications you're never really in a situation where a production product is dealing with visual scenes/objects it has never seen before. In a factory, in smart stores, in cars, AR scenarios like I just don't see it happening. And then if this case is removed, I'm thinking, ok so when can I get good enough results from a tiny dataset? Machine Learning isn't magic, you're trying to learn a function with 100 million parameters using a dataset, I just don't see the math working out. More data provides better results, it inserts more information to create a more relevant mapping function from input to output.

Third point is great! As long as the models are somehow based on real-world scans I think a lot of good can come from this. The funny thing is, there is so much bias in networks trained today precisely because the data captured is usually small and captured from a specific area/population/setting. If you had a great synthetic data generation engine you could at least generate equal representation of gender, age groups, ethnicities, ... etc.

Overall great points!

cbrun · 6 years ago

All valid points - thanks for the comment.

1. Totally agree with the point about data science talent being one of the bottlenecks. And in that area, the big tech companies already have the best & brightest workforce. Most data is crap is a bit of a stretch, but agree that a lot of pre-processing is required. But more and more platforms and/or services company are filling this gap.

2. 100%. This is a short term problem.

3. Agree to some extent. You can span latent spaces on top of an initial batch to overcome scarcity and use GANs to scale up. Data augmentation is not perfect but solves a lot of problems in production. Also specifically as it relates to computer vision, it's "easier" to know what the real world looks like and try to replicate it. It's hard in practice but the assumptions are mostly agreed upon.

ggggtez · 6 years ago

Exactly this. Amazing, a picture of a white-mans arm holding a box of orange juice. But what if the person is a woman, or dark skinned, or has prosthetics, or is a child, or it's a bag/bottle instead of a box, or the lighting is different, or the camera is low resolution, or there is someone standing in the way...

Anyone in this space is well aware that the benefit of big data isn't just the amount of data, but that's it's a real representative sample of the type of data you are actually going to need to work with. Big data solves the problem of people being bad at creating simulations. To suggest simulations as a solution to big data is kinda getting the relationship backwards.

cbrun · 6 years ago

Respectfully disagree. Again, I'm not suggesting SD will solve all problems. Big data is critical and will remain so. However, using a combination of SD and real data will make AI algorithms more robust than using big data alone. I do agree that the world is messy and it's hard to recreate the chaos and weirdness of the world. However, to think that at some point we won't be able to completely mimic the real world and all the variations out there is strange. Re: your example, its actually pretty easy to span millions of humans based on ethnicity, age, body mass, etc. It's just a matter of time until this problem gets solved.

1. This diminishes the "equalization" factor since you need a lot of data to get a good approximation of the distribution anyways. 2. You need to create a model based off of the set, which mathematically should be close to the same problem as just building the target model.

1. You need to create a model that probably uses some stat distribution to generate. Your target model will just learn that distribution. 2. Your initial assumptions create a distribution, and that is not going to be the same distro of real-world data. Maybe painfully off-base. I've worked on this problem for months and it's fairly difficult to get perfect in an easy (modeled by simple stat distributions) scenario.

cfusting · 6 years ago

The author miss-understands how simulated data is created by GANS, VAEs, and other non-physics based simulations. Let's say you have a dataset and would like to create synthetic data using it and a GAN. Then you wish to estimate the distribution D of the data with a GAN. To do so the GAN learns the joint distribution P(X1, X2, ..., Xn) (where in the image case each X is usually a pixel) such that one may sample from D and obtain a new, synthetic image. Indeed, one will generate novel data but the distribution D that was estimated is merely a description of the original data at best and in practice a little bit (or a lot) off.

Now turn to the machine learning problem we sought to solve with the new synthetic data: what is the P(y|X1, X2, ..., Xn) where y is usually a class like "bird". In other words given an image predict its label. Since the data was generated knowing only the statistics of the original data, it can add no value beyond plausible examples developed using the original data itself.

Will this improve the accuracy of a model by providing additional edge case examples and filling in gaps? Somewhat. Will it understand data not represented by the original data and substitute for more thorough, diverse datasets? Absolutely not.

In terms of model improvement, yes synthetic data can help. In terms of the arms race? No. True examples provide knowledge that is unique. If one used a physics engine (GTA is popular for self-drivings cars) one can gather truly novel data; this is not the case for GANS.

It's concerning how willing people are to write articles on this subject without understanding the mathematics underlying the technology.

Do your homework and RTFM.

EsssM7QVMehFPAs · 6 years ago

You are ignoring the fact that generative AI is not closed-loop algorithm. You can synthesize expected features in a data set and feed them to the detector - out of bounds of the generative neural network that rather serves the purpose of mapping into (a subset of) the proper input space.

The power of synthesis is not within the GAN or VAE, it is in the outside mechanism that guides the creation of content with specific domain knowledge about the feature space.

This might not replace the value of real data, but it will allow to accelerate bootstrap, improve coverage (at cost of accuracy), or provide free environments for auxiliary processes like CI/CD in many deep learning applications.

There is a lot of published material on synthetic data augmentation if you actually look for it.

missosoup · 6 years ago

Everything you said doesn't dispute the above comment and agrees with its core premise:

"In terms of model improvement, yes synthetic data can help. In terms of the arms race? No. True examples provide knowledge that is unique. "

throwawaymath · 6 years ago

omarhaneef · 6 years ago

This one time we wanted to partner with a medical research team in the university. They had data on a particular disease, and wanted to present it to users visually.

Now this was for the public good and we were going to fund the technology to display the data, and they would provide the data. This way people could assess how much various drugs could help them and what the outcomes were.

It was also thought that other researchers could find patterns in the data.

Suddenly the not-for-profit institute got cold feet because they would be "giving away" the data they had spent millions to acquire. Meanwhile we, a for-profit institute, were happy to fund our share as a public good.

They decided that, instead of giving away their data, they would give away simulated data. This, it was felt, would benefit the patients and researchers who might draw conclusions from the data.

Now these are phds at the top of their field. But, you know, its sort of obvious that all they would do is reproduce their biases and make it so that no one else could challenge those biases. I mean, for you data science types, this is 101.

Ever since that experience, I have a distrust of simulated data.

Sounds like a pretty bad experience indeed - surprised that was their recommendation vs completely anonymizing the data. You didn't share if you went ahead with it and saw any results? If so, it sounds like they were not good. Either way, I don't think you can let one bad experience cast doubt on a whole field. There are plenty of example of medical research institutes using synthetic data in combination with real patient data to improve their neural nets. I'm no medical expert, but data augmentation or full simulation works when it's used in the right context. Having said that creating biased algorithms that generate biased data is certainly a reality as well.

we didn't go ahead with it because we kept having calls with the different people in the university, and they kept putting off the decision. This was years ago and as far as I know, maybe their process is still going on.

Buttons840 · 6 years ago

You could simulate your own data that disagrees with their simulated data and thus draw attention to the absurdity of the whole situation.

this made me chuckle out loud. Great idea. Should have done exactly that.

def not for every situation. I was happy to see that these guys weren't starting from medical use-cases. Way too hard.

vonnik · 6 years ago

kory · 6 years ago

The answer is no, since when you generate data, you either

* Have a set of data as a "basis"

* No or small training set to use

There are problems where generating data can work, but they're specific problems or can only be used for rare edge-cases that don't show up enough in a dataset. For the most difficult problems it is probably just as difficult to generate "correct" data as it is to generate a model without real-world data.

Yeah but you're assuming that you know how their creating this data. Just throwing an option out there - what if they created a latent space of 3D people and iteratively expanded it with GANs and 2D real image datasets. That would generalize.

Just a thought, not sure what's really going on there, I just know that they probably have something interesting they're cooking up!

This is a really crazy vision

Training a GAN to generate people without a significantly large dataset (if that's even possible) is probably just as difficult of a problem as just building the model you want in the end without sufficient data.

Assuming those image sets are small they will create a model with a large bias. If you're talking about fine-tuning an existing model with small datasets, this is done already and works fairly well if not overused.

It all comes down to: to create the first "data-generating" model you need a lot of data and compute. Expanding it is a different story, but that isn't where the problem lies. We come full-circle back to the same problem as what we started with: big players can afford to build these models and small players can't.

zwieback · 6 years ago

I read this because it's an interesting question but the article is just an advertisment for the poster's company. I didn't really learn anything and half-way through felt like a sucker.

Should you write an article that can be summed up by "no"?

>For a lot of tasks the performance works well, but for extreme precision it will not fly — yet.

But we all knew that going in, didn't we?

Disagree, otherwise I obviously wouldn't have written this piece. ;) Synthetic data (SD) is not a silver bullet that will solve all problems, but it opens up a lot of opportunities. I'm seeing cool startups using SD to accelerate their R&D efforts and launch products in production in ways I didn't see 2 years ago. Imho, the quality of SD is reaching a tipping point and the sim2real gap is starting to disappear.

throwlaplace · 6 years ago

This not a personal attack but I would just like to my point out the insane number of weasel words/marketing speak phrases (ie intentionally imprecise but rhetorically powerful terms) in your response

>is not a silver bullet

>opens up a lot of opportunities

>accelerate their R&D efforts

>Imho

>reaching a tipping point

>gap

>starting to disappear.

This is what it reads like:

"we're getting to being able to approach the cusp of potentially honing in on the vicinity of the realization of this technology in the future"

So your entire response doesn't commit to any strong claim at all but it sure sounds like it does!

Incidentally I wonder if you're aware you're doing this or if it comes, subconsciously, from reading lots of writing that's like this.

https://en.wikipedia.org/wiki/Betteridge%27s_Law_of_Headline...

Wikipedia itself states "the adage fails to make sense with questions that are more open-ended than strict yes-no questions". I guess this means there's no black or white answer but shades of gray. How shocking: reality is more complex than it seems! ;)

hansdieter1337 · 6 years ago

If you use computer generated images to teach your networks, you have a nice network for computer generated images. It might be good to pre-train a net (instead of random weights), but you still need labeling of real world images. E.g., a guy once trained a self driving car model in GTA 5. I’m sure this algorithm won’t do great in the real world. But I already see an industry forming, promising to get rid of all labeled data. And there will be idiots believing it. That’s how all the Silicon Valley startups work. Unicorns on Powerpoint slides and an empty basket in reality. (src: living and working in the valley)

I read about the guy who used GTA to train a neural net. I think he was trying to make the point that although obviously imperfect, using simulated data could actually work. I'm not saying using simulated data (SD) should the be-all end-all for training neural nets, but we're seeing algorithms performing better when they're trained using a combination of real labelled data and SD rather than real data alone. I hear your point though about hype cycles and the tunnel vision SV can often fall into.