Andrew Ng says AI has a proof-of-concept-to-production gap

While this is indeed clickbait as mentioned by others - I am consistently shocked with how little the most common technique for ensuring that a model you trained works on unseen data, cross-validation, is used in the real world.

I had it drilled into my brain that I really shouldn't trust anything except the average validation score of a (preferably high K value) K-fold cross evaluated model when trying to get an idea of how well my ML algorithm performs on unseen data. Apparently most people in my field (NLP) did not have this drilled into their head either. This is partly why NLP is filled with unreproducable scores (because the magically score if it was ever found was only found on seed #3690398 train/test split)

As far as I'm concerned, if you didn't cross-validate, the test set score is basically useless.

andy99 · 4 years ago

The point of the article is more that even if all of your testing and validation is rigorous and the performance looks great, trivial changes in the production data can break your model anyway.

My view is that all high value production models should include out of distribution detection, uncertainty quantification, and other model specific safeguards (like self-consistency) to confirm that the model is only being used to predict on data it is competent to handle.

Der_Einzige · 4 years ago

All that is only needed because incremental learning algorithms don't really work all that well. It's a dirty secret in the field that we still don't have good answers for catastrophic forgetting in neural networks (the best candidate incremental learner as of right now), and the other alternatives are far worse.

fighterpilot · 4 years ago

This is good to have but it doesn't really address the problem of predictive accuracy in the presence of nonstationarity. The safeguards just help us switch off the model at the right time. We're still stuck with no capability in the new environment.

eyegor · 4 years ago

This sounds an awful lot like guassian processes which are fairly common in research environments. I don't know how common it is to deploy guassian processes in the real world, but I see published papers integrating them into other models all the time. The gist is that instead of input -> prediction you get input -> prediction + sigma (every prediction is given as a guassian distribution).

erichahn · 4 years ago

No, this does not solve the problem that he describes in the article. You can have a great crossvalidation score and still struggle on unseen data if the data is relatively dissimilar from your train set. Like X-Ray scans produced from a different machine. There are numerous other examples. CNNs on images for example are famously known to disintegrate on images + white noise (which look the same to a human).

Rexxar · 4 years ago

For the last example, could they train on "images + white noise" instead ?

fighterpilot · 4 years ago

I think cross validation is a powerful tool but not always necessary and is prone to abuse, such as overfittimg to the test set.

A skilled modeller can reduce variance using domain specific tricks that are more powerful at variance reduction than cross validation. But still cross val is usually good to use as well.

Winter is coming

Fordec · 4 years ago

Yeah hate to say it, but you're probably right. AI hype is the tech equivalent of a land war in Asia.

The only slight consolation though I find is that this time when the fog clears it's not going to be a complete waste as we're going to have much improved data engineering processes, data gathering methodologies, some scale improvements and improved data parallelism and a further sizeable portion of the research field will be cut off and put into the "that's not real AI" category and used in production software. There will be doom mongers, but if we come out of this with a much more professionalized interaction between software and the physical world then it was all still worth it all.

Come back again in 15 years and we'll find a new generation taking yet another crack at this building on the missteps of the now.

dnautics · 4 years ago

100%. And I work for a company that (I think? I've never actually run the model myself) that deploys an AI model. It's pretty good. It hits 95% accuracy. Solves a real pain point for humans.

It's also totally in something where it's obvious that this should work. We aren't even using "latest and greatest" ML algos, I'm pretty sure what we are using is a really cobbled together ML stuff from a few years ago, probably "latest and greatest" from half a decade ago when ML was just kicking up.

But holy shit, there are so many interconnecting and annoying bits in the non-ML part of the stack (where I am). Our codebase has gotten rather messy (for understandable reason) trying to negotiate leaky abstractions between different clients needs and international standards (and we're only in 3 countries)... And we have a very broken data pipeline (It works well enough to get the job done but I don't sleep well at night) for making sure there are good pulls for the ML engineers to deal with -- and this is code written by folks who should know better about concepts like data gravity, just when you're doing it hastily on startup timescales and startup labor it's (understandably) not going to come out pretty. And all of this is why I haven't even had time to poke into the AI bits, not even stand up an instance for localdev.

Supposedly our competitors aren't even using real AI, just mechanically turked stuff. Yeah. Of course. Just the real messy domain of dealing with these human systems is bad enough to sink a ton of money without even getting to the point where you have enough money to buy some expensive data scientists and ML engineers.

MichaelMoser123 · 4 years ago

"AI hype is the tech equivalent of a land war in Asia" - that's quotable! i did a web search on this phrase with several search engines, and it appeared just here.

i am still not convinced about ML winter, as it has found its killer app with advertising (i mean previous AI generations didn't find an equivalent cash cow).

Also: why don't they just specify that this ML model has been trained with this type of medical equipment? Couldn't they make it part of the SLA to use the same type of equipment in the field as that of the training images?

Deleted Comment

beforeolives · 4 years ago

Why winter instead of people filling in the gaps that exist in their processes?

grphtrdr · 4 years ago

Maybe in the West. I would have a hard time believing Chinese researchers feel this way.

sam_lowry_ · 4 years ago

And it will be the second AI winter.

kendallpark · 4 years ago

Two cents as an MD-(CS)PhD student studying what I've heard referred to as "the last mile problem."

My stack trace of investigation:

- The model is good, we just need to get the doctors to trust the model.

- The model is good, we need to figure out how to build an informed trust in the model (so as to avoid automation bias).

- The model is good, we need informed trust, but we can't tackle the trust issue without first figuring out a realistic deployment scenario.

- The model is good, we need informed trust, we need a realistic deployment scenario, but there are some infrastructural issues that make deployment incredibly difficult.

After painstaking work with real-life EHR system, sanity-check model inference against realistic deployment scenario.

- Holy crap, the model is bad and not at all suitable for deployment. 0.95 AUC, subject of a previous publication, and fails on really obvious cases.

My summary so far of "why?": assumptions going into model training are wildly out of sync with the assumptions of deployment. It's "Hidden Tech Debt in ML" [1] on steroids.

[1] https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f...

You've probably seen it, but a more recent, related paper (that I think has some of the same authors) about inherent features of modern ML that make models so fragile, even if they test OK:

Underspecification Presents Challenges for Credibility in Modern Machine Learning, D'Amour et al., https://arxiv.org/abs/2011.03395

I had not seen it yet, excited to read it! Thanks!

my question: why don't they just specify that this ML model has been trained with this type of medical equipment? Couldn't they make it part of the SLA to use the same type of equipment in the field as that used to obtain the training images?

Pokepokalypse · 4 years ago

"In theory, there is no difference between theory and practice. In practice, there is."

- Benjamin Brewster . . . 1882

lr1970 · 4 years ago

Or its variant:

"The difference between practice and theory is greater in practice than in theory."

fny · 4 years ago

Clickbait headline. He did not say it's a long way from use, but instead that it's challenging to ensure models translate well to real world conditions.

Yes, it's a challenge, especially with vision models, but it's doable. Health care models I've worked on have been put into production, and they just need to be monitored to remain effective.

dang · 4 years ago

Ok, we've replaced the title with something he actually said.

thinkloop · 4 years ago

I've said for a long time that we are currently in the fitbit stage of AI.

"Ng was responding to a question about why machine learning models trained to make medical decisions that perform at nearly the same level as human experts are not in clinical use"

"“It turns out,” Ng said, “that when we collect data from Stanford Hospital, then we train and test on data from the same hospital, indeed, we can publish papers showing [the algorithms] are comparable to human radiologists in spotting certain conditions.”

But, he said, “It turns out [that when] you take that same model, that same AI system, to an older hospital down the street, with an older machine, and the technician uses a slightly different imaging protocol, that data drifts to cause the performance of AI system to degrade significantly. In contrast, any human radiologist can walk down the street to the older hospital and do just fine."

milleramp · 4 years ago

It may also be a different shared group culture, different lessons learned and previous screwups passed down to the next generations. I also saw this in aerospace.

systemvoltage · 4 years ago

AI has a much larger "An Executive heard it in a marketing pitch from an external consultant or at an AI conference" to "Engineering implementation" gap.

dekhn · 4 years ago

As somebody who has intentionally not published low quality work, it amuses me to see people finally recognizing that a lot of the most regarded papers just don't have any real impact in the real world of life sciences/health research/medicine. What doesn't amuse me is the career success these folks have had by selling overfit, undergeneralizing models.

mark4 · 4 years ago