OpenAI Five - Readit News

Disclosure: I work on Google Cloud (and vaguely helped with this).

For me, one of the most amazing things about this work is that a small group of people (admittedly well funded) can show up and do what used to be the purview of only giant corporations.

The 256 P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these results ran for just a weekend.

In days past, researchers would never have had access to this kind of computing unless they worked for a national lab. Now it's just a budgetary decision. We're getting closer to a (more) level playing field, and this is a wonderful example.

naturalgradient · 7 years ago

I would just want to comment that while this is true in principle, it's also slightly misleading because it does not include how much tuning and testing is necessary until one gets to this result.

Determining the scale needed, fiddling with the state/action/reward model, massively parallel hyper-parameter tuning.

I may be overestimating but I would reckon with hyper-parameter tuning and all that was easily in the 7-8 figure range for retail cost.

This is slightly frustrating in an academic environment when people tout results for just a few days of training (even with much smaller resources, say 16 gpus and 512 CPUs) when the cost of getting there is just not practical, especially for timing reasons. E.g. if an experiment runs 5 days, it doesn't matter that it doesnt use large scale resources, because realistically you need 100s of runs to evaluate a new technique and get it to the point of publishing the result, so you can only do that on a reasonable time scale if you actually have at least 10x the resources needed to run it.

Sorry, slightly off topic, but it's becoming a more and more salient point from the point of academic RL users.

boulos · 7 years ago

I hear you. I would say that this work is tantamount to what would normally be a giant NSF grant.

Depending on your institution, this is precisely why we (and other providers) give out credits though. Similar to Intel/NVIDIA/Dell donating hardware historically, we understand we need to help support academia.

abhgh · 7 years ago

This is a very good point. While the final model might be a weekend of training, getting there is a lot more iterations/work.

YeGoblynQueenne · 7 years ago

>> Toss in some more support GPUs and we're at maybe $2500/hr all in.

Amazing, indeed. That's only 5/8 of my entire travelling allowance, from my PhD studentship.

Hey, I'd even have some pocket money left over to go to a conference or two!

Deleted Comment

zmk_ · 7 years ago

This is more than many academic positions pay (or cost the uni) in a year; esp. in Europe. This an absurd amount of money/resources and more of a sign that this part of academia is not about outsmarting but outspending the "competition".

Cidan · 7 years ago

(I too work for Google Cloud)

I agree. One of the most amazing things about watching this project unfold is just how quickly it went from 0 to 100 with minimal overhead. It's amazing to watch companies and individuals push the boundaries of what is possible with just the push of a button.

Deleted Comment

malux85 · 7 years ago

Agree 100%, pay as you go compute has helped us tremendously. A large amount of our time is spent analysing results and interpreting models and the ability to power up and train a new topology without the huge cap-ex is the reason my company is still alive!

eachro · 7 years ago

I agree that 2500x48hrs is probably a reasonably cost to pay for these kind of sweet results. But it is a bit prohibitively expensive for an ML hobbyist to try to replicate in their own free time. I wonder if there is some way to do this w/o all the expensive compute. Pre-trained models is one step towards this, but so much of the learning(for the hobbyist) comes from struggling to get your RL model off the ground in the first place.

boulos · 7 years ago

It'd be interesting to see in the graphs (when the OpenAI team gets to them) how good you get at X hours in. Because if you're pretty good at X=4, that's still amazing.

Edit: I guess https://blog.openai.com/content/images/2018/06/bug-compariso... is approximately indicative (you currently need about 3 days to beat humans).

bcheung · 7 years ago

Transfer learning is about the best we can do right now. Using a fully trained ResNet / XCeptionNet and then tacking on your own layers after the end is within reach to hobbyists with just a single GPU on their desktop. There's still a decent amount of learning for the user even with pre-trained models.

yazr · 7 years ago

Quoting from the original article.

> This logic takes milliseconds per tick to execute, versus nanoseconds for Chess or Go engines.

So this is game engine itself, taking up the CPUs. Maybe the DoTA code can be optimized x2 for self play?!

IIRC AlphaZero was about x10 more efficient than AlphaGo Zero due to algorithm improvement.

So overall, $100K for the final training run, which maybe can go down to $10K for a different domain of similar complexity.

boulos · 7 years ago

Interesting question! I assume in the Bot/headless mode, it's pretty optimized to skip the part needed for rendering, but you still need to do enough physics and other state update.

Best case, I'd assume at least a few ms per tick, because games become as complex as possible and still fit in 30 fps (33 ms, much of which is rendering, but still much happens regardless of producing pixels).

de_watcher · 7 years ago

> Maybe the DoTA code can be optimized x2 for self play?!

Please don't. Every time they change something, several other things break.

Ok, just kidding.

But their fix logs are really look like the game logic is built by adding a hack on top of a hack with no automatic testing. Everything seems to hold on the playtesting.

Deleted Comment

ionforce · 7 years ago

Does the approach scale at low scale though? Like, would this project only bear fruit when run at large scale?

Getting budgetary approval isn't easy for everyone. Especially with an unproven process. And even then, there could be a mistake in the pipeline. All that money down the drain.

boulos · 7 years ago

Good question! RL (and ML generally) definitely works better as you add more scale, but I still feel that this particular work is roughly "grand challenge" level. You shouldn't expect to just try this out as your first foray :).

I will note this paragraph from the post:

> RL researchers (including ourselves) have generally believed that long time horizons would require fundamentally new advances, such as hierarchical reinforcement learning. Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way of exploring.

which is mostly about the challenge of longer time horizons (and therefore LSTM related). If your problem is different / has a smaller space, I think this is soon going to be very approachable. That is, we recently demonstrated training ResNet-50 for $7.50.

There certainly exist a set of problems for which RL shouldn't cost you more than the value you get out of it, and for which you can demonstrate enough likelihood of success. RL itself though is still at the bleeding edge of ML research, so I don't consider it unusual that it's unproven.

windows_tips · 7 years ago

Depends on the function you're approximating.

Deleted Comment

masonicb00m · 7 years ago

Great work! Having access to this scale of computing for so cheap really is amazing

So as someone working in reinforcement learning who has used PPO a fair bit, I find this quite disappointing from an algorithmic perspective.

The resources used for this are almost absurd and my suspicion is, especially considering [0], that this comes down to an incredibly expensive random search in the policy space. Or rather, I would want to see a fair bit of analysis to be shown otherwise.

Especially given all the work in intrinsic motivation, hierarchical learning, subtask learning, etc, the sort of intermediate summary of most of these papers from 2015-2018 is that so many of these newer heuristics are too brittle/difficult to make work, so we resort to slightly-better-than brute force.

https://arxiv.org/abs/1803.07055

gdb · 7 years ago

(I work at OpenAI on the Dota team.)

Dota is far too complex for random search (and if that weren't true, it would say something about human capability...). See our gameplay reel for an example of some of the combos that our system learns: https://www.youtube.com/watch?v=UZHTNBMAfAA&feature=youtu.be. Our system learns to generalize behaviors in a sophisticated way.

What I personally find most interesting here is that we see qualitatively different behavior from PPO at large scale. Many of the issues people pointed to as fundamental limitations of RL are not truly fundamental, and are just entering the realm of practical with modern hardware.

We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding.

naturalgradient · 7 years ago

Thank you for taking the time to respond, I appreciate it.

Well I guess my question regarding the expensiveness comes down to wondering about the sample efficiency, i.e. are there not many games that share large similar state trajectories that can be re-used? Are you using any off-policy corrections, e.g. IMPALA style?

Or is that just a source off noise that is too difficult to deal with and/or the state space is so large and diverse that that many samples are really needed? Maybe my intuition is just way off, it just feels like a very very large sample size.

Reminds me slightly of the first version of the non-hierarchical TensorFlow device placement work which needed a fair bit of samples, and a large sample efficiency improvement in the subsequent hierarchical placer. So I recognise there is large value in knowing the limits of a non-hierarchical model now and subsequent models should rapidly improve sample efficiency by doing similar task decomposition?

backpropaganda · 7 years ago

> Dota is far too complex for random search

Why? We know that random search is smart enough to find a solution if given arbitrarily large computation. So, that random search is not smart enough for Dota with the computational budget you used, is not obvious. Maybe random search would work with 2x your resources? Maybe something slightly smarter than random search (simulated annealing) would work with 2x your resources?

> and if that weren't true, it would say something about human capability

No it would not. A human learning a game by playing a few thousand games is a very different problem than a bot using random search over billions of games. The policy space remains large, and the human is not doing a dumb search, because the human does not have billions of games to work with.

> See our gameplay reel for an example of some of the combos that our system learns

> Our system learns to generalize behaviors in a sophisticated way.

You're underestimating random search. It's ironic, because you guys did the ES paper.

Eridrus · 7 years ago

> If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it.

Are there that many domains for which this is relevant?

Game AI seems to be the most obvious case and, on a tangent, I did find it kind of interesting that DeepMind was founded to make AI plug and play for commercial games.

But unless Sim-to-Real can be made to work it seems pretty narrow. So it sort of seems like exchanging one research problem (sample-efficient RL) for another.

Not to say these results aren't cool and interesting, but I'm not sold on the idea that this is really practical yet.

YeGoblynQueenne · 7 years ago

>> Our system learns to generalize behaviors in a sophisticated way.

Could you elaborate? One of the criticisms of RL and statistical machine learning in general is that models generalise extremely poorly, unless provided with unrealistic amounts of training data.

fjsolwmv · 7 years ago

Why Dota and not something like adverse-weather helicopter flying which is more "useful"?

obastani · 7 years ago

I think the "simple random search" algorithm in the paper you linked is not so simple -- it's basically using numerical gradient descent with a few bells and whistles invented by the reinforcement learning community in the past few decades. So perhaps it would be more fair to say that gradient descent (not random search) has proven to be a pretty solid foundation for model-free reinforcement learning.

naturalgradient · 7 years ago

Yes, I am aware, I did not mean random search as in random actions, but random search with improved heuristics to find a policy.

The point being that that the bells and whistles of PPO and other relatively complaticated algorithms (e.g. Q-PROP), namely the specific clipped objective, subsampling, and a (in my experience) very difficult to tune baseline using the same objective, do not significantly improve over gradient descent.

And I think Ben Recht's arguments [0] expands on that a bit in terms of what we are actually doing with policy gradient (not using a likelihood ratio model like in PPO) but still conceptually similar enough for the argument to hold.

So I think it comes down to two questions: How much do 'modern' policy gradient models improve on REINFORCE, and how much better is REINFORCE really than random search? The answer thus far seemed to be: not that much better, and I am trying to get a sense of if this was a wrong intuition.

[0] http://www.argmin.net/2018/02/20/reinforce/