DeepMind program finds diamonds in Minecraft without being taught

An important caveat from the paper

>Moreover, we follow previous work in accelerating block breaking because learning to hold a button for hundreds of consecutive steps would be infeasible for stochastic policies, allowing us to focus on the essential challenges inherent in Minecraft.

toxik · 5 months ago

Like all things RL, it is 99.9% about engineering the environment and rewards. As one of the authors stated elsewhere here, there is a reward for completing each of 12 steps necessary to find diamonds.

Mostly I'm tired of RL work being oversold by its authors and proponents by anthropomorphizing its behaviors. All while this "agent" cannot reliably learn to hold down a button, literally the most basic interaction of the game.

red75prime · 5 months ago

The "no free lunch" theorem. You can't start from scratch and expect your program to repeat 4 billion years of evolution collecting inductive biases useful in our corner of our Universe in a matter of hours[1].

While it's possible to bake in this particular inductive bias (repetitive actions might be useful), they decided not to (it's just not that interesting).

[1] And you certainly can't reproduce the observation selection effect in a laboratory. That is the thing that makes it possible to overcome the "no free lunch" theorem: our existence and intelligence are conditional on evolution being possible and finding the right biases.

We have to bake in inductive biases to get results. We have to incentivize behaviors useful (or interesting) to us to get useful results instead of generic exploration.

LPisGood · 5 months ago

When I was a child and first played Minecraft I clicked instead of held and after 10 minutes I gave up, deciding that Minecraft was too hard.

freeone3000 · 5 months ago

RL is useful for action selection and planning. Actually determining the mechanics of the game can be achieved with explicit instruction and definition of an action set.

I suppose whether you find this result intriguing or not depends on if you’re looking to build result-building planning agents over an indeterminate (and sizable!) time horizon, in which case this is a SOTA improvement and moderately cool, or if you’re looking for a god in the machine, which this is not.

SpaceManNabs · 5 months ago

If you have an alternative for RL in these use cases, please feel free to share.

When RL works, it really works.

The only alternative I have seen is deep networks with MCTS, and they are quickly to ramp up to decent quality. But they hit caps relatively quickly.

o11c · 5 months ago

And a relevant piece of ancient wisdom (exact date not known, but presumably before 1970):

> In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

> “What are you doing?”, asked Minsky.

> “I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.

> “Why is the net wired randomly?”, asked Minsky.

> “I do not want it to have any preconceptions of how to play”, Sussman said.

> Minsky then shut his eyes.

> “Why do you close your eyes?”, Sussman asked his teacher.

> “So that the room will be empty.”

> At that moment, Sussman was enlightened.

lgeorget · 5 months ago

Well, to be fair... I (a human) had to look it up online the first time I played as well. I was repeatedly clicking on the same tree for an entire minute before that. I even tried several different trees just in case.

fusionadvocate · 5 months ago

But it is possible to discover by holding down the button and realizing the block is getting progressively more "scratched".

kharak · 5 months ago

In my mind, this generalizes to the same problem with other non-stochastic (deterministic) operations like logical conclusions (A => B) .

I have a running bet with friend that humans encode deterministic operations in neural networks, too, while he thinks there has to be another process at play. But there might be something extra helping our neural networks learn the strong weights required for it. Or the answer is again: "more data".

FrustratedMonky · 5 months ago

"accelerating block breaking because learning to hold a button for hundreds of consecutive steps "

This is fine, and does not impact the importance of figuring out the steps.

Anybody that has done any tuning on systems that run at different speeds, the adjusting for the speed difference is just engineering, and allows you to get on with more important/inventive work.

JohnKemeny · 5 months ago

I'm not sure it's a serious caveat if the "hint" or "control" is in the manual.

suddenlybananas · 5 months ago

Sorry, I don't quite follow what you mean?

Hamuko · 5 months ago

Turns out that AI are much better at playing video games if they're allowed to cheat.

thesz · 5 months ago

"It allows AI to understand its physical environment and also to self-improve over time, without a human having to tell it exactly what to do."

ks1723 · 5 months ago

I my view, the 'exactly' is crucial here. They do implicitly tell the model what to do by encoding it in the reward function:

In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.

This is also why I think the title of the article is slightly misleading.

Key to Dreamer’s success, says Hafner, is that it builds a model of its surroundings and uses this ‘world model’ to ‘imagine’ future scenarios and guide decision-making.

Can you look at the world model, like you can look at Waymo's world model? Or is it hidden inside weights?

Machine learning with world models is very interesting, and the people doing it don't seem to say much about what the models look like. The Google manipulation work talks endlessly about the natural language user interface, but when they get to motion planning, they don't say much.

danijar · 5 months ago

Yes, you can decode the imagined scenarios into videos and look at them. It's quite helpful during development to see what the model gets right or wrong. See Fig. 3 in the paper: https://www.nature.com/articles/s41586-025-08744-2

Animats · 5 months ago

So, prediction of future images from a series of images. That makes a lot of sense.

Here's the "full sized" image set.[1] The world model is low-rez images. That makes sense. Ask for too much detail and detail will be invented, which is not helpful.

[1] https://media.springernature.com/full/springer-static/image/...

lnsru · 5 months ago

I implemented an acoustic segmentation system in FPGA recently. The whole world model was a long list of known events and states with feasible transitions. Plus novel things not observed before. Basically rather dumb state machine with machine learning part attached to acoustic sensors. Of course, both parts could be hidden behind weights. But state machine was easily readable and that was the biggest advantage of it.

mnky9800n · 5 months ago

Why would an accounting system need acoustic sensors?

jtsaw · 5 months ago

I’d say it’s more like Waymo’s world model. The main actor uses a latent vector representation of the state of the game to make decisions. This latent vector at train time is meant to compress a bunch of useful information about the game. So while you can’t really understand the actual latent vector that represents state, you do know it encodes at least the state of the game.

This world model stuff is only possible in environments that are sandboxed. Ie you can represent the state of the world in an and have a way of producing the next state given a current state and action. Things like Atari games, robot simulations, etc

TeMPOraL · 5 months ago

> Can you look at the world model, like you can look at Waymo's world model? Or is it hidden inside weights?

I imagine it's the latter, and in general, we're already dealing with plenty of models with world models hidden inside their weights. That's why I'm happy to see the direction Anthropic has been taking with their interpretability research over the years.

Their papers, as well as most discussions around them, focus on issues of alignment/control, safety, and generally killing the "stochastic parrot" meme and keeping it dead - but I think it'll be even more interesting to see attempts at mapping how those large models structure their world models. I believe there's scientific and philosophical discoveries to be made in answering why these structures look the way they do.

namaria · 5 months ago

> killing the "stochastic parrot" meme

This was clearly the goal of the "Biology of LLMs" (and ancillary) paper but I am not convinced.

They used a 'replacement model' that by their own admission could match the output of the LLM ~50% of the time, and the attribution of cognition related labels to the model hinges entirely on the interpretation of the 'activations' seen in the replacement model.

So they created a much simpler model, that sorta kinda can do what the LLM can do in some instances, contrived some examples, observed the replacement model and labeled what it was doing very liberally.

Machine learning and the mathematics involved is quite interesting but I don't see the need to attribute neuroscience/psychology related terms to them. They are fascinating in their own terms and modelling language can clearly be quite powerful.

But thinking that they can follow instructions and reason is the source of much misdirection. The limits of this approach should make clear that feeding text to a text continuation program should not lead to parsing the generated text for commands and running these commands, because the tokens the model outputs are just statistically linked to the tokens inputted to them. And as the model takes more tokens from the wild, it can easily lead to situations that are very clearly an enormous risk. Pushing the idea that they are reasoning about the input is driving all sorts of applications that seeing them as statistical text continuation programs would make clear are a glaring risk.

Machine learning and LLMs are interesting technology that should be investigated and developed. Reasoning by induction that they are capable of more than modelling language is bad science and drives bad engineering.

reportgunner · 5 months ago

Article makes it seem like finding diamonds is some kind of super complicated logical puzzle. In reality the hardest part is knowing where to look for them and what tool you need to mine them without losing them once you find them. This was given to the AI by having it watch a video that explains it.

If you watch a guide on how to find diamonds it's really just a matter of getting an iron pickaxe, digging to the right depth and strip mining until you find some.

Hi, author here! Dreamer learns to find diamonds from scratch by interacting with the environment, without access to external data. So there are no explainer videos or internet text here.

It gets a sparse reward of +1 for each of the 12 items that lead to the diamond, so there is a lot it needs to discover by itself. Fig. 5 in the paper shows the progression: https://www.nature.com/articles/s41586-025-08744-2

itchyjunk · 5 months ago

Since diamonds are surrounded by danger and if it dies, it loses its items and such, why would it not be satisfied after discovering iron pick axe or somesuch? Is it in a mode where it doesn't lose its item when it dies? Does it die a lot? Does it ever try digging vertically down? Does it ever discover other items/tools you didn't expect it to? Open world with sparse reward seems like such a hard problem. Also, once it gets the item, does it stop getting reward for it? I assume so. Surprised that it can work with this level of sparse rewards.

Deleted Comment

I just want to express my condolences in how difficult it must be to correct basic misunderstandings that can be immediately corrected from reading the fourth paragraph under the section "Diamonds are forever"

Thanks for your hard work.

ryan-duve · 5 months ago

For the curious, from the link above:

> log, plank, stick, crafting table, wooden pickaxe, cobblestone, stone pickaxe, iron ore, furnace, iron ingot, iron pickaxe and diamond

kuu · 5 months ago

While I agree with your comment, this sentence:

"This was given to the AI by having it watch a video that explains it."

This was not as trivial as it may seem just a few months ago...

rcxdude · 5 months ago

EDIT: Incorrect, see below

it didn't watch 'a video', it watched many, many hours of video of playing minecraft (with another specialised model feeding in predictions of keyboard and mouse inputs from the video). It's still a neat trick, but it's far from the implied one-shot learning.

NVHacker · 5 months ago

Alpha Star was also trained initially from youtube videos of pros playing Starcraft. I would argue that it was pretty trivial a few years ago.

skwirl · 5 months ago

>This was given to the AI by having it watch a video that explains it.

That is not what the article says. It says that was separate, previous research.

Bluglionio · 5 months ago

I don't get it. How can you reduce this achievement down to this?

Have you gotten used to some ai watching a video and 'getting it' so fast that this is boring? Unimpressive?

jerf · 5 months ago

The other replies have observed that the AI didn't get any "videos to watch" but I'd also observe that this is being used as an English colloquialism. The AIs aren't "watching videos", they're receiving videos as their training data. That's quite different from what is coming to your mind as "watching a video" as if the AI watched a single YouTube tutorial video once and got the concept.

I feel like you are jumping to conclusions here, I wasn't talking about the achievement or the AI, I was talking about the article and the way it explains finding diamonds in minecraft to people who don't know how to find diamonds in minecraft.

rowanG077 · 5 months ago

The AI is able to learn from video and you don't find that even a little bit impressive? Well I disagree.

see [0]

[0] https://news.ycombinator.com/item?id=43609826

DeborahEmeni_ · 5 months ago

The “holding a button” thing actually resonated. It feels like the real work here is engineering the reward structure to make exploration even remotely viable. Dreamer’s world model might be cool, but most of the heavy lifting still seems to come from how forgiving the Minecraft environment is for training.

I do wonder though: if you swapped Minecraft for a cloud-based synthetic world with similar physics but messier signals, like object permanence or social reasoning, would Dreamer still hold up? Or is it just really good at the kind of clean reward hierarchies that games offer?

lupusreal · 5 months ago

Characterizing finding diamonds as "mastering" Minecraft is extremely silly. Tantamount to saying "AI masters Chess: Captures a pawn." Getting diamonds is not even close to the hardest challenge in the game, but most readers of Nature probably don't have much experience playing Minecraft so the title is actually misleading, not harmless exaggeration.

zimpenfish · 5 months ago

> Getting diamonds is not even close to the hardest challenge in the game

Mining diamonds isn't even necessary if you build, e.g., ianxofour's iron farm on day one and trade that iron[0] with a toolsmith, armourer, and weaponsmith. You can get full diamond armour, tools, and weapons pretty quickly (probably a handful of game weeks?)

[0] Main faff here is getting them off their base trade level.

True, and if the objective is to get some raw diamonds as fast as possible demonstrating mastery of the game, I'd expect a strategy like making a boat, finding a shipwreck and then a buried treasure chest. Takes just a few minutes usually.

Really though, if AI wants to impress me it needs to collect an assortment of materials and build a decent looking base. Play the way humans usually play.

I agree with you, this is just the start and Minecraft has a lot more to offer for future research!

CodeCompost · 5 months ago

I didn't know that Nature did movie promotions.

YeGoblynQueenne · 5 months ago

Reinforcement learning is very good with games.

>> In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.

And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.

For a lot of things, VLMs are good enough already to provide rewards. Give them the recent images and a text description of the task and ask whether the task was accomplished or not.

For a more general system, you can annotate videos with text descriptions of all the tasks that have been accomplished and when, then train a reward model on those to later RL against.

IshKebab · 5 months ago

Plenty of real world situations have clear objectives with obvious rewards.

Example.

> And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.

I encourage you to read deepmind's work with robots.

Oh I have. For example I remember this project:

>> Quantitatively, the QT-Opt approach succeeded in 96% of the grasp attempts across 700 trial grasps on previously unseen objects. Compared to our previous supervised-learning based grasping approach, which had a 78% success rate, our method reduced the error rate by more than a factor of five.

https://research.google/blog/scalable-deep-reinforcement-lea...

That was in 2018.

So what do you think, is vision-based robotic manipulation and grasping a solved problem, seven years later? Is QT-Opt now an established industry standard in training robots with RL?

Or was that just another project that was announced with great fanfare and hailed as a breakthrough that would surely lead to great increase of capabilities... only to pop, fizzle and disappear in obscurity without any real-world result, a few years later? Like most of DeepMind's RL projects do?

smokel · 5 months ago

> games have clear objectives with obvious rewards. The real world, not so much.

Tell that to the people here who are trying to turn their startup ideas into money.

zamadatix · 5 months ago

I don't think folks go the startup path because the steps to go from idea to making money are obvious and clear.

janalsncm · 5 months ago

> it is never going to work in the real world

DeepSeek used RL to train R1, so that is clearly not true. But ignoring that, what is your alternative? Supervised learning? Good luck finding labels if you don’t even know what the objective is.

No, let's not ignore DeepSeek: text is not the real world any more than Minecraft is the real world.

And why do I have to offer an alternative? If it's not working, it's not working, regardless of whether there's an alternative (that we know of) or not.

colechristensen · 5 months ago

Who would have thought you could get your TAS run published in Nature if you used enough hot buzzwords. (they have been using various old-school-definition "artifical intelligence" algorithms for a long time)

https://tasvideos.org/