Readit News logoReadit News
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
jeromebaek · 6 years ago
cbrun · 6 years ago
Wikipedia itself states "the adage fails to make sense with questions that are more open-ended than strict yes-no questions". I guess this means there's no black or white answer but shades of gray. How shocking: reality is more complex than it seems! ;)
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
notahacker · 6 years ago
There's a massive difference between 'develop technologies which are indistinguishable from magic to people who don't know how they work' and 'completely simulate the informational complexity of the world using the human minds and computers with comparatively limited information processing capabilities.'

I'm not sure the second is even a logical possibility, never mind a practical one.

cbrun · 6 years ago
That's assuming human minds and computers won't exponentially increase their processing capabilities. Moore's law disproves the later and I hope Neuralink or some other crazy tech company will disprove the former. That's pretty much the hope of all transhumanists.
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
notahacker · 6 years ago
> However, to think that at some point we won't be able to completely mimic the real world and all the variations out there is strange.

Why? I think it's strange to believe the opposite: that something as simple as a computer program designed by something as simple as the human mind definitely should be able to adequately simulate the complexity of the real world.

cbrun · 6 years ago
Really? I think if we were to bring back our close ancestors (4-5 generations away) they'd look at our world like we see Harry Potter's: pure magic. I mean flying metal birds, fire that instantly turns on and off, machine that move around like ghosts, small boxes that can talk back, musicians on demand in a box? I think you get my point, you're selling humanity short. There's no limit to human ingenuity and there's a reason Elon Musk still doubts we're living in a simulation.
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
omarhaneef · 6 years ago
This one time we wanted to partner with a medical research team in the university. They had data on a particular disease, and wanted to present it to users visually.

Now this was for the public good and we were going to fund the technology to display the data, and they would provide the data. This way people could assess how much various drugs could help them and what the outcomes were.

It was also thought that other researchers could find patterns in the data.

Suddenly the not-for-profit institute got cold feet because they would be "giving away" the data they had spent millions to acquire. Meanwhile we, a for-profit institute, were happy to fund our share as a public good.

They decided that, instead of giving away their data, they would give away simulated data. This, it was felt, would benefit the patients and researchers who might draw conclusions from the data.

Now these are phds at the top of their field. But, you know, its sort of obvious that all they would do is reproduce their biases and make it so that no one else could challenge those biases. I mean, for you data science types, this is 101.

Ever since that experience, I have a distrust of simulated data.

cbrun · 6 years ago
Sounds like a pretty bad experience indeed - surprised that was their recommendation vs completely anonymizing the data. You didn't share if you went ahead with it and saw any results? If so, it sounds like they were not good. Either way, I don't think you can let one bad experience cast doubt on a whole field. There are plenty of example of medical research institutes using synthetic data in combination with real patient data to improve their neural nets. I'm no medical expert, but data augmentation or full simulation works when it's used in the right context. Having said that creating biased algorithms that generate biased data is certainly a reality as well.
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
throwawaymath · 6 years ago
Information theoretically speaking, how do you generate a "synthetic" dataset (as the article calls it) with the same fidelity as an original dataset without having access to a critical basis set of the original? What would you do to obtain that fidelity? Extrapolate from sufficiently many independent conclusions drawn from the original?

And as a followup, if you can generate a synthetic dataset by extrapolating from sufficiently many independent conclusions drawn from the original (as opposed to having access to the original itself), would you still need to use such a dataset for training?

Things like Monte Carlo simulation can be used to approximate real world conditions, but they can't typically capture the full information density of organic data. For example, generating a ton of artificial web traffic for fraud analysis or incident response only captures a few dimensions of what real world user traffic captures.

The author talks about simulating data to focus on edge cases or avoid statistical bias, but I don't see how simulated data actually achieves that.

cbrun · 6 years ago
All good points. In this case, the original dataset is created from real world body scans. You collect enough scans in this "base collection of scans" to have a "real" distribution of the world. You can then span a latent space on top of this initial distribution and use GANs to further scale it. This isn't as good as real yet, but it generates results that are better than limited quantities of real data alone. Agree with your point around the Monte Carlo simulation. Synthetic data is not the be-all end-all to train neural networks.
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
hansdieter1337 · 6 years ago
If you use computer generated images to teach your networks, you have a nice network for computer generated images. It might be good to pre-train a net (instead of random weights), but you still need labeling of real world images. E.g., a guy once trained a self driving car model in GTA 5. I’m sure this algorithm won’t do great in the real world. But I already see an industry forming, promising to get rid of all labeled data. And there will be idiots believing it. That’s how all the Silicon Valley startups work. Unicorns on Powerpoint slides and an empty basket in reality. (src: living and working in the valley)
cbrun · 6 years ago
I read about the guy who used GTA to train a neural net. I think he was trying to make the point that although obviously imperfect, using simulated data could actually work. I'm not saying using simulated data (SD) should the be-all end-all for training neural nets, but we're seeing algorithms performing better when they're trained using a combination of real labelled data and SD rather than real data alone. I hear your point though about hype cycles and the tunnel vision SV can often fall into.
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
ggggtez · 6 years ago
Exactly this. Amazing, a picture of a white-mans arm holding a box of orange juice. But what if the person is a woman, or dark skinned, or has prosthetics, or is a child, or it's a bag/bottle instead of a box, or the lighting is different, or the camera is low resolution, or there is someone standing in the way...

Anyone in this space is well aware that the benefit of big data isn't just the amount of data, but that's it's a real representative sample of the type of data you are actually going to need to work with. Big data solves the problem of people being bad at creating simulations. To suggest simulations as a solution to big data is kinda getting the relationship backwards.

cbrun · 6 years ago
Respectfully disagree. Again, I'm not suggesting SD will solve all problems. Big data is critical and will remain so. However, using a combination of SD and real data will make AI algorithms more robust than using big data alone. I do agree that the world is messy and it's hard to recreate the chaos and weirdness of the world. However, to think that at some point we won't be able to completely mimic the real world and all the variations out there is strange. Re: your example, its actually pretty easy to span millions of humans based on ethnicity, age, body mass, etc. It's just a matter of time until this problem gets solved.
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
ggggtez · 6 years ago
Should you write an article that can be summed up by "no"?

>For a lot of tasks the performance works well, but for extreme precision it will not fly — yet.

But we all knew that going in, didn't we?

cbrun · 6 years ago
Disagree, otherwise I obviously wouldn't have written this piece. ;) Synthetic data (SD) is not a silver bullet that will solve all problems, but it opens up a lot of opportunities. I'm seeing cool startups using SD to accelerate their R&D efforts and launch products in production in ways I didn't see 2 years ago. Imho, the quality of SD is reaching a tipping point and the sim2real gap is starting to disappear.
cbrun commented on Simulated data: the great equalizer in the AI race?   medium.com/@charlesbrun/i... · Posted by u/cbrun
vonnik · 6 years ago
There are a couple points not generally made in discussions of data and the great AI race.

Most data is crap. So the mountains of data that are supposedly an advantage, in China or in the vaults of the large corporations, are not fit for purpose without a tremendous amount of data pre-processing, and even then... And that means the real chokepoint is data science talent, not data quantity. In other words, in many cases, the premises of this statement should be questioned.

Secondly, a lot of research is focused on few-shot, one-shot or no-shot learning. That is, the AI industry will make this constraint increasingly obsolete.

Thirdly, with synthetic data, it is only as good as the assumptions you made while creating it. But how did you reach those assumptions? By what argument should they be treated as a source of truth? If you are making those assumptions based on your wide experience with real-world data, well then, we run into the same scarcity, mediated by the mind of the person creating the synthetic data.

cbrun · 6 years ago
All valid points - thanks for the comment.

1. Totally agree with the point about data science talent being one of the bottlenecks. And in that area, the big tech companies already have the best & brightest workforce. Most data is crap is a bit of a stretch, but agree that a lot of pre-processing is required. But more and more platforms and/or services company are filling this gap.

2. 100%. This is a short term problem.

3. Agree to some extent. You can span latent spaces on top of an initial batch to overcome scarcity and use GANs to scale up. Data augmentation is not perfect but solves a lot of problems in production. Also specifically as it relates to computer vision, it's "easier" to know what the real world looks like and try to replicate it. It's hard in practice but the assumptions are mostly agreed upon.

u/cbrun

KarmaCake day39October 9, 2011View Original