gilbaz (u/gilbaz) - Readit News

gilbaz commented on Simulated data: the great equalizer in the AI race? medium.com/@charlesbrun/i... · Posted by u/cbrun

kory · 6 years ago

The answer is no, since when you generate data, you either

* Have a set of data as a "basis"

  1. This diminishes the "equalization" factor since you 
  need a lot of data to get a good approximation of the 
  distribution anyways.

  2. You need to create a model based off of the set,
  which mathematically should be close to the same problem 
  as just building the target model.

* No or small training set to use

  1. You need to create a model that probably uses some
  stat distribution to generate. Your target model will just 
  learn that distribution.

  2. Your initial assumptions create a distribution, and
  that is not going to be the same distro of real-world 
  data. Maybe painfully off-base. I've worked on this 
  problem for months and it's fairly difficult to get 
  perfect in an easy (modeled by simple stat distributions)
  scenario.

There are problems where generating data can work, but they're specific problems or can only be used for rare edge-cases that don't show up enough in a dataset. For the most difficult problems it is probably just as difficult to generate "correct" data as it is to generate a model without real-world data.

gilbaz · 6 years ago

Yeah but you're assuming that you know how their creating this data. Just throwing an option out there - what if they created a latent space of 3D people and iteratively expanded it with GANs and 2D real image datasets. That would generalize.

Just a thought, not sure what's really going on there, I just know that they probably have something interesting they're cooking up!

This is a really crazy vision

gilbaz commented on Simulated data: the great equalizer in the AI race? medium.com/@charlesbrun/i... · Posted by u/cbrun

omarhaneef · 6 years ago

This one time we wanted to partner with a medical research team in the university. They had data on a particular disease, and wanted to present it to users visually.

Now this was for the public good and we were going to fund the technology to display the data, and they would provide the data. This way people could assess how much various drugs could help them and what the outcomes were.

It was also thought that other researchers could find patterns in the data.

Suddenly the not-for-profit institute got cold feet because they would be "giving away" the data they had spent millions to acquire. Meanwhile we, a for-profit institute, were happy to fund our share as a public good.

They decided that, instead of giving away their data, they would give away simulated data. This, it was felt, would benefit the patients and researchers who might draw conclusions from the data.

Now these are phds at the top of their field. But, you know, its sort of obvious that all they would do is reproduce their biases and make it so that no one else could challenge those biases. I mean, for you data science types, this is 101.

Ever since that experience, I have a distrust of simulated data.

gilbaz · 6 years ago

def not for every situation. I was happy to see that these guys weren't starting from medical use-cases. Way too hard.

gilbaz commented on Simulated data: the great equalizer in the AI race? medium.com/@charlesbrun/i... · Posted by u/cbrun

throwawaymath · 6 years ago

Information theoretically speaking, how do you generate a "synthetic" dataset (as the article calls it) with the same fidelity as an original dataset without having access to a critical basis set of the original? What would you do to obtain that fidelity? Extrapolate from sufficiently many independent conclusions drawn from the original?

And as a followup, if you can generate a synthetic dataset by extrapolating from sufficiently many independent conclusions drawn from the original (as opposed to having access to the original itself), would you still need to use such a dataset for training?

Things like Monte Carlo simulation can be used to approximate real world conditions, but they can't typically capture the full information density of organic data. For example, generating a ton of artificial web traffic for fraud analysis or incident response only captures a few dimensions of what real world user traffic captures.

The author talks about simulating data to focus on edge cases or avoid statistical bias, but I don't see how simulated data actually achieves that.

gilbaz · 6 years ago

Cool points -

"...original dataset without having access to a critical basis set of the original?"

I think that they're not trying to copy existing datasets but are trying to generate new datasets that solve various computer vision use-cases. Looks lke they're using 3D photorealistic models and environments to then generate 2D data. It is a cool idea, if they had the ability to synthesize a large amount of 3D people and objects and insert them into 3D environment in ways that made sense and then run motion simulation, they could hypothetically create an incredible amount of high-quality data. Sounds pretty hard to do honestly...

I think Monte Carlo is used for something very different than computer vision / machine learning. Monte Carlo is usually used to estimate an average result given many dependent variables and a simplified model of the problem. So if I want to estimate how far my paper airplane will fly and I have a simulator, I would vary the paper thickness, folds and wind. Each time I would run the simulator, get a result and then I can estimate the average distance the paper airplane would go! (actually sounds like a fun project lol). Anyway this is just different.

Simulation is good for edge cases because you can simulate them disproportionally to their prevalence in the real world. So let's say that we're in a smart store and we want to recognize when an elderly person falls on the floor to send human help to the correct location. This happens maybe one in 5 year in a given store. If we were to gather data we may get 10 examples. If they can simulate this, they could simulate 100k elderly people falling and then train models to recognize it! Kind of crazy really.

gilbaz commented on Simulated data: the great equalizer in the AI race? medium.com/@charlesbrun/i... · Posted by u/cbrun

vonnik · 6 years ago

There are a couple points not generally made in discussions of data and the great AI race.

Most data is crap. So the mountains of data that are supposedly an advantage, in China or in the vaults of the large corporations, are not fit for purpose without a tremendous amount of data pre-processing, and even then... And that means the real chokepoint is data science talent, not data quantity. In other words, in many cases, the premises of this statement should be questioned.

Secondly, a lot of research is focused on few-shot, one-shot or no-shot learning. That is, the AI industry will make this constraint increasingly obsolete.

Thirdly, with synthetic data, it is only as good as the assumptions you made while creating it. But how did you reach those assumptions? By what argument should they be treated as a source of truth? If you are making those assumptions based on your wide experience with real-world data, well then, we run into the same scarcity, mediated by the mind of the person creating the synthetic data.

gilbaz · 6 years ago

Good points!

Most data is def crap lol.

I don't see few shot, one-shot or no-shot getting anywhere close to standard supervised learning for anything practical. It really doesn't make sense in production settings at all.

You have a function that you want to learn, let's say mapping between an RGB image to a segmentation map. For most applications you're never really in a situation where a production product is dealing with visual scenes/objects it has never seen before. In a factory, in smart stores, in cars, AR scenarios like I just don't see it happening. And then if this case is removed, I'm thinking, ok so when can I get good enough results from a tiny dataset? Machine Learning isn't magic, you're trying to learn a function with 100 million parameters using a dataset, I just don't see the math working out. More data provides better results, it inserts more information to create a more relevant mapping function from input to output.

Third point is great! As long as the models are somehow based on real-world scans I think a lot of good can come from this. The funny thing is, there is so much bias in networks trained today precisely because the data captured is usually small and captured from a specific area/population/setting. If you had a great synthetic data generation engine you could at least generate equal representation of gender, age groups, ethnicities, ... etc.

Overall great points!

gilbaz commented on Simulated data: the great equalizer in the AI race? medium.com/@charlesbrun/i... · Posted by u/cbrun

gilbaz · 6 years ago

I would definitely saw that this is extremely hard to accomplish on one hand and on the other if it works it would be a game changer!

A good simulation is the holy grail of AI. It solves the data bottleneck once the data can generalize to the real-world. Let's see them prove that!