I'm not sure the second is even a logical possibility, never mind a practical one.
Why? I think it's strange to believe the opposite: that something as simple as a computer program designed by something as simple as the human mind definitely should be able to adequately simulate the complexity of the real world.
Now this was for the public good and we were going to fund the technology to display the data, and they would provide the data. This way people could assess how much various drugs could help them and what the outcomes were.
It was also thought that other researchers could find patterns in the data.
Suddenly the not-for-profit institute got cold feet because they would be "giving away" the data they had spent millions to acquire. Meanwhile we, a for-profit institute, were happy to fund our share as a public good.
They decided that, instead of giving away their data, they would give away simulated data. This, it was felt, would benefit the patients and researchers who might draw conclusions from the data.
Now these are phds at the top of their field. But, you know, its sort of obvious that all they would do is reproduce their biases and make it so that no one else could challenge those biases. I mean, for you data science types, this is 101.
Ever since that experience, I have a distrust of simulated data.
And as a followup, if you can generate a synthetic dataset by extrapolating from sufficiently many independent conclusions drawn from the original (as opposed to having access to the original itself), would you still need to use such a dataset for training?
Things like Monte Carlo simulation can be used to approximate real world conditions, but they can't typically capture the full information density of organic data. For example, generating a ton of artificial web traffic for fraud analysis or incident response only captures a few dimensions of what real world user traffic captures.
The author talks about simulating data to focus on edge cases or avoid statistical bias, but I don't see how simulated data actually achieves that.
Anyone in this space is well aware that the benefit of big data isn't just the amount of data, but that's it's a real representative sample of the type of data you are actually going to need to work with. Big data solves the problem of people being bad at creating simulations. To suggest simulations as a solution to big data is kinda getting the relationship backwards.
>For a lot of tasks the performance works well, but for extreme precision it will not fly — yet.
But we all knew that going in, didn't we?
Most data is crap. So the mountains of data that are supposedly an advantage, in China or in the vaults of the large corporations, are not fit for purpose without a tremendous amount of data pre-processing, and even then... And that means the real chokepoint is data science talent, not data quantity. In other words, in many cases, the premises of this statement should be questioned.
Secondly, a lot of research is focused on few-shot, one-shot or no-shot learning. That is, the AI industry will make this constraint increasingly obsolete.
Thirdly, with synthetic data, it is only as good as the assumptions you made while creating it. But how did you reach those assumptions? By what argument should they be treated as a source of truth? If you are making those assumptions based on your wide experience with real-world data, well then, we run into the same scarcity, mediated by the mind of the person creating the synthetic data.
1. Totally agree with the point about data science talent being one of the bottlenecks. And in that area, the big tech companies already have the best & brightest workforce. Most data is crap is a bit of a stretch, but agree that a lot of pre-processing is required. But more and more platforms and/or services company are filling this gap.
2. 100%. This is a short term problem.
3. Agree to some extent. You can span latent spaces on top of an initial batch to overcome scarcity and use GANs to scale up. Data augmentation is not perfect but solves a lot of problems in production. Also specifically as it relates to computer vision, it's "easier" to know what the real world looks like and try to replicate it. It's hard in practice but the assumptions are mostly agreed upon.