Disclosure: I work on Google Cloud (and vaguely helped with this).
For me, one of the most amazing things about this work is that a small group of people (admittedly well funded) can show up and do what used to be the purview of only giant corporations.
The 256 P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these results ran for just a weekend.
In days past, researchers would never have had access to this kind of computing unless they worked for a national lab. Now it's just a budgetary decision. We're getting closer to a (more) level playing field, and this is a wonderful example.
I would just want to comment that while this is true in principle, it's also slightly misleading because it does not include how much tuning and testing is necessary until one gets to this result.
Determining the scale needed, fiddling with the state/action/reward model, massively parallel hyper-parameter tuning.
I may be overestimating but I would reckon with hyper-parameter tuning and all that was easily in the 7-8 figure range for retail cost.
This is slightly frustrating in an academic environment when people tout results for just a few days of training (even with much smaller resources, say 16 gpus and 512 CPUs) when the cost of getting there is just not practical, especially for timing reasons. E.g. if an experiment runs 5 days, it doesn't matter that it doesnt use large scale resources, because realistically you need 100s of runs to evaluate a new technique and get it to the point of publishing the result, so you can only do that on a reasonable time scale if you actually have at least 10x the resources needed to run it.
Sorry, slightly off topic, but it's becoming a more and more salient point from the point of academic RL users.
I hear you. I would say that this work is tantamount to what would normally be a giant NSF grant.
Depending on your institution, this is precisely why we (and other providers) give out credits though. Similar to Intel/NVIDIA/Dell donating hardware historically, we understand we need to help support academia.
This is more than many academic positions pay (or cost the uni) in a year; esp. in Europe. This an absurd amount of money/resources and more of a sign that this part of academia is not about outsmarting but outspending the "competition".
I agree. One of the most amazing things about watching this project unfold is just how quickly it went from 0 to 100 with minimal overhead. It's amazing to watch companies and individuals push the boundaries of what is possible with just the push of a button.
Agree 100%, pay as you go compute has helped us tremendously. A large amount of our time is spent analysing results and interpreting models and the ability to power up and train a new topology without the huge cap-ex is the reason my company is still alive!
I agree that 2500x48hrs is probably a reasonably cost to pay for these kind of sweet results. But it is a bit prohibitively expensive for an ML hobbyist to try to replicate in their own free time. I wonder if there is some way to do this w/o all the expensive compute. Pre-trained models is one step towards this, but so much of the learning(for the hobbyist) comes from struggling to get your RL model off the ground in the first place.
It'd be interesting to see in the graphs (when the OpenAI team gets to them) how good you get at X hours in. Because if you're pretty good at X=4, that's still amazing.
Transfer learning is about the best we can do right now. Using a fully trained ResNet / XCeptionNet and then tacking on your own layers after the end is within reach to hobbyists with just a single GPU on their desktop. There's still a decent amount of learning for the user even with pre-trained models.
Interesting question! I assume in the Bot/headless mode, it's pretty optimized to skip the part needed for rendering, but you still need to do enough physics and other state update.
Best case, I'd assume at least a few ms per tick, because games become as complex as possible and still fit in 30 fps (33 ms, much of which is rendering, but still much happens regardless of producing pixels).
> Maybe the DoTA code can be optimized x2 for self play?!
Please don't. Every time they change something, several other things break.
Ok, just kidding.
But their fix logs are really look like the game logic is built by adding a hack on top of a hack with no automatic testing. Everything seems to hold on the playtesting.
Does the approach scale at low scale though? Like, would this project only bear fruit when run at large scale?
Getting budgetary approval isn't easy for everyone. Especially with an unproven process. And even then, there could be a mistake in the pipeline. All that money down the drain.
Good question! RL (and ML generally) definitely works better as you add more scale, but I still feel that this particular work is roughly "grand challenge" level. You shouldn't expect to just try this out as your first foray :).
I will note this paragraph from the post:
> RL researchers (including ourselves) have generally believed that long time horizons would require fundamentally new advances, such as hierarchical reinforcement learning. Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way of exploring.
which is mostly about the challenge of longer time horizons (and therefore LSTM related). If your problem is different / has a smaller space, I think this is soon going to be very approachable. That is, we recently demonstrated training ResNet-50 for $7.50.
There certainly exist a set of problems for which RL shouldn't cost you more than the value you get out of it, and for which you can demonstrate enough likelihood of success. RL itself though is still at the bleeding edge of ML research, so I don't consider it unusual that it's unproven.
So as someone working in reinforcement learning who has used PPO a fair bit, I find this quite disappointing from an algorithmic perspective.
The resources used for this are almost absurd and my suspicion is, especially considering [0], that this comes down to an incredibly expensive random search in the policy space. Or rather, I would want to see a fair bit of analysis to be shown otherwise.
Especially given all the work in intrinsic motivation, hierarchical learning, subtask learning, etc, the sort of intermediate summary of most of these papers from 2015-2018 is that so many of these newer heuristics are too brittle/difficult to make work, so we resort to slightly-better-than brute force.
Dota is far too complex for random search (and if that weren't true, it would say something about human capability...). See our gameplay reel for an example of some of the combos that our system learns: https://www.youtube.com/watch?v=UZHTNBMAfAA&feature=youtu.be. Our system learns to generalize behaviors in a sophisticated way.
What I personally find most interesting here is that we see qualitatively different behavior from PPO at large scale. Many of the issues people pointed to as fundamental limitations of RL are not truly fundamental, and are just entering the realm of practical with modern hardware.
We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding.
Thank you for taking the time to respond, I appreciate it.
Well I guess my question regarding the expensiveness comes down to wondering about the sample efficiency, i.e. are there not many games that share large similar state trajectories that can be re-used? Are you using any off-policy corrections, e.g. IMPALA style?
Or is that just a source off noise that is too difficult to deal with and/or the state space is so large and diverse that that many samples are really needed? Maybe my intuition is just way off, it just feels like a very very large sample size.
Reminds me slightly of the first version of the non-hierarchical TensorFlow device placement work which needed a fair bit of samples, and a large sample efficiency improvement in the subsequent hierarchical placer. So I recognise there is large value in knowing the limits of a non-hierarchical model now and subsequent models should rapidly improve sample efficiency by doing similar task decomposition?
Why? We know that random search is smart enough to find a solution if given arbitrarily large computation. So, that random search is not smart enough for Dota with the computational budget you used, is not obvious. Maybe random search would work with 2x your resources? Maybe something slightly smarter than random search (simulated annealing) would work with 2x your resources?
> and if that weren't true, it would say something about human capability
No it would not. A human learning a game by playing a few thousand games is a very different problem than a bot using random search over billions of games. The policy space remains large, and the human is not doing a dumb search, because the human does not have billions of games to work with.
> See our gameplay reel for an example of some of the combos that our system learns
> Our system learns to generalize behaviors in a sophisticated way.
You're underestimating random search. It's ironic, because you guys did the ES paper.
> If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it.
Are there that many domains for which this is relevant?
Game AI seems to be the most obvious case and, on a tangent, I did find it kind of interesting that DeepMind was founded to make AI plug and play for commercial games.
But unless Sim-to-Real can be made to work it seems pretty narrow. So it sort of seems like exchanging one research problem (sample-efficient RL) for another.
Not to say these results aren't cool and interesting, but I'm not sold on the idea that this is really practical yet.
>> Our system learns to generalize behaviors in a sophisticated way.
Could you elaborate? One of the criticisms of RL and statistical machine learning in general is that models generalise extremely poorly, unless provided with unrealistic amounts of training data.
I think the "simple random search" algorithm in the paper you linked is not so simple -- it's basically using numerical gradient descent with a few bells and whistles invented by the reinforcement learning community in the past few decades. So perhaps it would be more fair to say that gradient descent (not random search) has proven to be a pretty solid foundation for model-free reinforcement learning.
Yes, I am aware, I did not mean random search as in random actions, but random search with improved heuristics to find a policy.
The point being that that the bells and whistles of PPO and other relatively complaticated algorithms (e.g. Q-PROP), namely the specific clipped objective, subsampling, and a (in my experience) very difficult to tune baseline using the same objective, do not significantly improve over gradient descent.
And I think Ben Recht's arguments [0] expands on that a bit in terms of what we are actually doing with policy gradient (not using a likelihood ratio model like in PPO) but still conceptually similar enough for the argument to hold.
So I think it comes down to two questions: How much do 'modern' policy gradient models improve on REINFORCE, and how much better is REINFORCE really than random search? The answer thus far seemed to be: not that much better, and I am trying to get a sense of if this was a wrong intuition.
This article (like pretty much all from OpenAI) is really well done. I love the format and supporting material - makes it waay more digestible and fun to read in comparison to something from arxiv. The video breakdowns really drive the results home.
Good point - but I think that the difference is valuable. If that is the average person's first touch point with the content, then it would do a better job of making it accessible than a technical paper. Agreed that a follow-up detailed post or paper would be awesome!
Far too many hyperlinks though. Who clicks on hyperlinks for words like "defeat", "complex", "train" and "move"? Seems like if I link them then they'll link me and we'll all get higher ranking search results. Maybe I'm the only one who gets annoyed by this.
It is essentially the same frequency of links that you'd see on any Wikipedia article. In a field where there is an enormous amount of jargon, it is probably a good thing that they clearly define as much as possible.
This is a really interesting writeup, specially if you know a bit more about how Dota works.
That it managed to learn creep blocking from scratch was really surprising for me. To creep block you need to go out of your way to stand in front of the creeps and consciously keep doing so until they reach their destination. Creep blocking just a bit is almost imperceptible and you need to do it all the way to get a big reward out of it.
I also wonder if their reward function directly rewarded good lane equilibrium or if that came indirectly from the other reward functions
It's not really "from scratch". The bots are rewarded for the number of creeps they block, so it's not impossible that they would find some behavior to influence this score.
They are using preemptible CPUs/GPUs on Google Compute Engine for model training? Interesting. The big pro of that is cost efficiency, which isn't something I expected OpenAI to be optimizing. :P
How does training RL with preemptible VMs work when they can shut down at any time with no warning? A PM of that project asked me the same question awhile ago (https://news.ycombinator.com/item?id=14728476) and I'm not sure model checkpointing works as well for RL. (maybe after each episode?)
Cost efficiency is always important, regardless of your total resources.
The preemptibles are just used for the rollouts — i.e. to run copies of the model and the game. The training and parameter storage is not done with preemptibles.
If these (or other similar) experiments would show viability of this network architecture, the cost could be decreased a lot with development of even more specialized hardware.
Also one could look at the cost of the custom development of bots and AIs using other more specialized techniques: sure, it might require more processing power to train this network, but it will not require as much specialized human interaction to adapt this network to a different task. In which case, the human labor cost is decreased significantly, even if initial processing costs are higher. So in a way you guys do actually optimize cost efficiency.
Disclosure: I work on Google Cloud (and with OpenAI), though I'm not a PM :).
As gdb said below, the GPUs doing the training aren't preemptible. Just the workers running the game (which don't need GPUs).
I'm surprised you felt cost isn't interesting. While OpenAI has lots of cash, that doesn't mean they shouldn't do 3-5x more computing for the same budget. The 256 "optimizers" cost less than $400/hr, while if you were using regular cores the 128k workers would be over $6k/hr. So using preemptible is just the responsible choice :).
There's lots of low hanging fruit in any of these setups, and OpenAI is executing towards a deadline, so they need to be optimizing for their human time. That said, I did just encourage the team to consider checkpointing the DOTA state on preemption though, to try to eke out even more utilization. Similarly, being tighter on the custom shapes is another 5-10% "easily".
>OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.
A bit disappointing, it would be very cool to see what kind of communication they'd develop.
Would be interesting to see if when one agent declines to help another several times, the other one would decide against helping him when he calls. The logical explanation would then be that the agent would come to value his life more than his comrade's (because he is helping, and his comrade has refused several times). The human explanation would be that he refuses to help out of spite. It could even lead to those two agents "hating" the other, though it would be more like cold calculation.
I wanted to add the observation that all the restricted heroes are ranged. Necrophos, Sniper, Viper, Crystal Maiden, and Lich.
Since playing a lane as a ranged hero is very different from playing the same lane as a melee hero, I wonder whether the AI has learned to play melee heroes yet.
Not only are they ranged, but this lineup is very snowball-oriented, i.e. the optimal play style with this kind of lineup is to gain a small advantage in the early game and then keep pushing towers together aggressively. The middle-to-late game doesn't really matter. Whoever wins the early game wins the game. And we do know that bots are going to be good at early game last hitting.
I've played DotA for over 10 years so this development is quite relevant to me. So excited to see this next month!
Although it's extremely impressive, all the restrictions will definitely make this less appealing to the audience (shown in the Reddit thread comments).
For me, one of the most amazing things about this work is that a small group of people (admittedly well funded) can show up and do what used to be the purview of only giant corporations.
The 256 P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these results ran for just a weekend.
In days past, researchers would never have had access to this kind of computing unless they worked for a national lab. Now it's just a budgetary decision. We're getting closer to a (more) level playing field, and this is a wonderful example.
Determining the scale needed, fiddling with the state/action/reward model, massively parallel hyper-parameter tuning.
I may be overestimating but I would reckon with hyper-parameter tuning and all that was easily in the 7-8 figure range for retail cost.
This is slightly frustrating in an academic environment when people tout results for just a few days of training (even with much smaller resources, say 16 gpus and 512 CPUs) when the cost of getting there is just not practical, especially for timing reasons. E.g. if an experiment runs 5 days, it doesn't matter that it doesnt use large scale resources, because realistically you need 100s of runs to evaluate a new technique and get it to the point of publishing the result, so you can only do that on a reasonable time scale if you actually have at least 10x the resources needed to run it.
Sorry, slightly off topic, but it's becoming a more and more salient point from the point of academic RL users.
Depending on your institution, this is precisely why we (and other providers) give out credits though. Similar to Intel/NVIDIA/Dell donating hardware historically, we understand we need to help support academia.
Amazing, indeed. That's only 5/8 of my entire travelling allowance, from my PhD studentship.
Hey, I'd even have some pocket money left over to go to a conference or two!
Deleted Comment
I agree. One of the most amazing things about watching this project unfold is just how quickly it went from 0 to 100 with minimal overhead. It's amazing to watch companies and individuals push the boundaries of what is possible with just the push of a button.
Deleted Comment
Edit: I guess https://blog.openai.com/content/images/2018/06/bug-compariso... is approximately indicative (you currently need about 3 days to beat humans).
> This logic takes milliseconds per tick to execute, versus nanoseconds for Chess or Go engines.
So this is game engine itself, taking up the CPUs. Maybe the DoTA code can be optimized x2 for self play?!
IIRC AlphaZero was about x10 more efficient than AlphaGo Zero due to algorithm improvement.
So overall, $100K for the final training run, which maybe can go down to $10K for a different domain of similar complexity.
Best case, I'd assume at least a few ms per tick, because games become as complex as possible and still fit in 30 fps (33 ms, much of which is rendering, but still much happens regardless of producing pixels).
Please don't. Every time they change something, several other things break.
Ok, just kidding.
But their fix logs are really look like the game logic is built by adding a hack on top of a hack with no automatic testing. Everything seems to hold on the playtesting.
Deleted Comment
Getting budgetary approval isn't easy for everyone. Especially with an unproven process. And even then, there could be a mistake in the pipeline. All that money down the drain.
I will note this paragraph from the post:
> RL researchers (including ourselves) have generally believed that long time horizons would require fundamentally new advances, such as hierarchical reinforcement learning. Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way of exploring.
which is mostly about the challenge of longer time horizons (and therefore LSTM related). If your problem is different / has a smaller space, I think this is soon going to be very approachable. That is, we recently demonstrated training ResNet-50 for $7.50.
There certainly exist a set of problems for which RL shouldn't cost you more than the value you get out of it, and for which you can demonstrate enough likelihood of success. RL itself though is still at the bleeding edge of ML research, so I don't consider it unusual that it's unproven.
Deleted Comment
The resources used for this are almost absurd and my suspicion is, especially considering [0], that this comes down to an incredibly expensive random search in the policy space. Or rather, I would want to see a fair bit of analysis to be shown otherwise.
Especially given all the work in intrinsic motivation, hierarchical learning, subtask learning, etc, the sort of intermediate summary of most of these papers from 2015-2018 is that so many of these newer heuristics are too brittle/difficult to make work, so we resort to slightly-better-than brute force.
https://arxiv.org/abs/1803.07055
Dota is far too complex for random search (and if that weren't true, it would say something about human capability...). See our gameplay reel for an example of some of the combos that our system learns: https://www.youtube.com/watch?v=UZHTNBMAfAA&feature=youtu.be. Our system learns to generalize behaviors in a sophisticated way.
What I personally find most interesting here is that we see qualitatively different behavior from PPO at large scale. Many of the issues people pointed to as fundamental limitations of RL are not truly fundamental, and are just entering the realm of practical with modern hardware.
We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding.
Well I guess my question regarding the expensiveness comes down to wondering about the sample efficiency, i.e. are there not many games that share large similar state trajectories that can be re-used? Are you using any off-policy corrections, e.g. IMPALA style?
Or is that just a source off noise that is too difficult to deal with and/or the state space is so large and diverse that that many samples are really needed? Maybe my intuition is just way off, it just feels like a very very large sample size.
Reminds me slightly of the first version of the non-hierarchical TensorFlow device placement work which needed a fair bit of samples, and a large sample efficiency improvement in the subsequent hierarchical placer. So I recognise there is large value in knowing the limits of a non-hierarchical model now and subsequent models should rapidly improve sample efficiency by doing similar task decomposition?
Why? We know that random search is smart enough to find a solution if given arbitrarily large computation. So, that random search is not smart enough for Dota with the computational budget you used, is not obvious. Maybe random search would work with 2x your resources? Maybe something slightly smarter than random search (simulated annealing) would work with 2x your resources?
> and if that weren't true, it would say something about human capability
No it would not. A human learning a game by playing a few thousand games is a very different problem than a bot using random search over billions of games. The policy space remains large, and the human is not doing a dumb search, because the human does not have billions of games to work with.
> See our gameplay reel for an example of some of the combos that our system learns
> Our system learns to generalize behaviors in a sophisticated way.
You're underestimating random search. It's ironic, because you guys did the ES paper.
Are there that many domains for which this is relevant?
Game AI seems to be the most obvious case and, on a tangent, I did find it kind of interesting that DeepMind was founded to make AI plug and play for commercial games.
But unless Sim-to-Real can be made to work it seems pretty narrow. So it sort of seems like exchanging one research problem (sample-efficient RL) for another.
Not to say these results aren't cool and interesting, but I'm not sold on the idea that this is really practical yet.
Could you elaborate? One of the criticisms of RL and statistical machine learning in general is that models generalise extremely poorly, unless provided with unrealistic amounts of training data.
The point being that that the bells and whistles of PPO and other relatively complaticated algorithms (e.g. Q-PROP), namely the specific clipped objective, subsampling, and a (in my experience) very difficult to tune baseline using the same objective, do not significantly improve over gradient descent.
And I think Ben Recht's arguments [0] expands on that a bit in terms of what we are actually doing with policy gradient (not using a likelihood ratio model like in PPO) but still conceptually similar enough for the argument to hold.
So I think it comes down to two questions: How much do 'modern' policy gradient models improve on REINFORCE, and how much better is REINFORCE really than random search? The answer thus far seemed to be: not that much better, and I am trying to get a sense of if this was a wrong intuition.
[0] http://www.argmin.net/2018/02/20/reinforce/
That it managed to learn creep blocking from scratch was really surprising for me. To creep block you need to go out of your way to stand in front of the creeps and consciously keep doing so until they reach their destination. Creep blocking just a bit is almost imperceptible and you need to do it all the way to get a big reward out of it.
I also wonder if their reward function directly rewarded good lane equilibrium or if that came indirectly from the other reward functions
How does training RL with preemptible VMs work when they can shut down at any time with no warning? A PM of that project asked me the same question awhile ago (https://news.ycombinator.com/item?id=14728476) and I'm not sure model checkpointing works as well for RL. (maybe after each episode?)
Cost efficiency is always important, regardless of your total resources.
The preemptibles are just used for the rollouts — i.e. to run copies of the model and the game. The training and parameter storage is not done with preemptibles.
Also one could look at the cost of the custom development of bots and AIs using other more specialized techniques: sure, it might require more processing power to train this network, but it will not require as much specialized human interaction to adapt this network to a different task. In which case, the human labor cost is decreased significantly, even if initial processing costs are higher. So in a way you guys do actually optimize cost efficiency.
As gdb said below, the GPUs doing the training aren't preemptible. Just the workers running the game (which don't need GPUs).
I'm surprised you felt cost isn't interesting. While OpenAI has lots of cash, that doesn't mean they shouldn't do 3-5x more computing for the same budget. The 256 "optimizers" cost less than $400/hr, while if you were using regular cores the 128k workers would be over $6k/hr. So using preemptible is just the responsible choice :).
There's lots of low hanging fruit in any of these setups, and OpenAI is executing towards a deadline, so they need to be optimizing for their human time. That said, I did just encourage the team to consider checkpointing the DOTA state on preemption though, to try to eke out even more utilization. Similarly, being tighter on the custom shapes is another 5-10% "easily".
Don't forget, they're hiring!
A bit disappointing, it would be very cool to see what kind of communication they'd develop.
Deleted Comment
Dead Comment
Since playing a lane as a ranged hero is very different from playing the same lane as a melee hero, I wonder whether the AI has learned to play melee heroes yet.
Although it's extremely impressive, all the restrictions will definitely make this less appealing to the audience (shown in the Reddit thread comments).