Genie 3: A new frontier for world models

Consistency over multiple minutes and it runs in real time at 720p? I did not expect world models to be this good yet.

> Genie 3’s consistency is an emergent capability

So this just happened from scaling the model, rather than being a consequence of deliberate architecture changes?

Edit: here is some commentary on limitations from someone who tried it: https://x.com/tejasdkulkarni/status/1952737669894574264

> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).

> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work

> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)

> - Action space is limited

> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.

Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...

ojosilva · 21 days ago

Gaming is certainly a use case, but I think this is primarily coming as synthetic data generation for Google's robots training in warehouses:

https://www.theguardian.com/technology/2025/aug/05/google-st...

Gemini Robot launch 4 mo ago:

https://news.ycombinator.com/item?id=43344082

onlyrealcuzzo · 20 days ago

This also seems pretty valuable to create CGI cut scenes...

kkukshtel · 21 days ago

I similarly am surprised at how fast they are progressing. I wrote this piece a few months ago about how I think steering world model output is the next realm of AAA gaming:

https://kylekukshtel.com/diffusion-aaa-gamedev-doom-minecraf...

But even when I wrote that I thought things were still a few years out. I facetiously said that Rockstar would be nerd-sniped on GTA6 by a world model, which sounded crazy a few months ago. But seeing the progress already made since GameNGen and knowing GTA6 is still a year away... maybe it will actually happen.

ewoodrich · 21 days ago

> Rockstar would be nerd-sniped on GTA6 by a world model

I'm having trouble parsing your meaning here.

GTA isn't really a "drive on the street simulator", is it? There is deliberate creative and artistic vision that makes the series so enjoyable to play even decades after release, despite the graphics quality becoming more dated every year by AAA standards.

Are you saying someone would "vibe model" a GTAish clone with modern graphics that would overtake the actual GTA6 in popularity? That seems extremely unlikely to me.

corimaith · 21 days ago

The future of games was MMORPGs and RPG-ization in general as other genres adopted progression systems. But the former two are simply too expensive and risky even today for AAA to develop. Which brings us to another point, the problem with Western AAA is more about high levels of risk aversion, which is what's really feeding the lack of imaginative. And that's more to do with the economics of opportunity cost to the S&P 500.

Anyways, crafting pretty looking worlds is one thing, but you still need to fill them in with something worth doing, and that's something we haven't really figured out. That's one of the reasons why the sandbox MMORPG was developed as opposed to "themeparks". The underlying systems, the backend is the real meat here. At most with the world models right now is that you're replacing 3d artists and animators, but I would not say that is a real bottleneck in relation to one's own limitations.

throwmeaway222 · 21 days ago

I'm trying to wrap my head around this since we're still seeing text spit out slowly ( I mean slowly as in 1000's of tokens a second)

I'm starting to think some of the names behind LLMs/GenAI are cover names for aliens and any actual humans involved have signed an NDA that comes with millions of dollars and a death warrant if disobeyed.

ivape · 21 days ago

So this just happened from scaling the model

Unbelievable. How is this not a miracle? So we're just stumbling onto breakthroughs?

silveraxe93 · 21 days ago

Is it actually unbelievable?

It's basically what every major AI lab head is saying from the start. It's the peanut gallery that keeps saying they are lying to get funding.

shreezus · 21 days ago

There are a lot of "interesting" emergent behaviors that happen just a result of scaling.

Kind of like how a single neuron doesn't do much, but connect 100 billion of them and well...

spaceman_2020 · 21 days ago

becoming really, really hard to refute the Simulation Theory

kfarr · 21 days ago

Bitter lesson strikes again!

nxobject · 21 days ago

_Especially_ given the goal of a world model using a rasters-only frame-by-frame approach. Holy shit.

diwank · 21 days ago

> Future robots may learn in their dreams...

So prescient. I definitely think this will be a thing in the near future ~12-18 months time horizon

casenmgreen · 21 days ago

I may be wrong, but this seems to make no sense.

A neural net can produce information outside of its original data set, but it is all and directly derived from that initial set. There are fundamental information constraints here. You cannot use a neural net to itself generate from its existing data set wholly new and original full quality training data for itself.

You can use a neural net to generate data, and you can train a net on that data, but you'll end up with something which is no good.

dingnuts · 21 days ago

what is a robot dream when there is clearly no consciousness?

What's with this insane desire for anthropomorphism? What do you even MEAN learn in its dreams? Fine-tuning overnight? Just say that!

neom · 21 days ago

I'm invested in a startup that is doing something unrelated robotics, but they're spending a lot of time in Shenzhen, I keep a very close eye on robotics and was talking to their CTO about what he is seeing in China, versions of this are already being implemented.

Deleted Comment

Aco- · 21 days ago

"Do Androids Dream of Electric Sheep?"

casenmgreen · 21 days ago

The guy who tried was invite by Google to try it.

He seems to me too enthusiastic, such that I feel Google asked him in particular because they trusted him to write very positively.

alphabetting · 21 days ago

I doubt there was a condition on writing positively. Other people who tested have said this won't replace engines. https://togelius.blogspot.com/2025/08/genie-3-and-future-of-...

echelon · 21 days ago

I don't know. I wasn't there and I'm excited.

I think this puts Epic Games, Nintendo, and the whole lot into a very tough spot if this tech takes off.

I don't see how Unreal Engine, with its voluminous and labyrinthine tomes of impenetrable legacy C++ code, survives this. Unreal Engine is a mess, gamers are unhappy about it, and it's a PITA to develop with. I certainly hate working with it.

Innovator's Dilemma fast approaching the entire gaming industry and they don't even see it coming it's happening so fast.

Exciting that building games could become as easy as having the idea itself. I'm imagining something like VRChat or Roblox or Fortnite, but where new things are simply spoken into existence.

It's absolutely terrifying that Google has this much power.

csomar · 21 days ago

Also he is ex-Google Mind. Like the worst kind of pick you can make when there are dozens of eligible journalists out there.

forrestthewoods · 21 days ago

> this is a clear glimpse into the future.

Not for video games it isn’t.

dlivingston · 21 days ago

Unless and until state can be stored outside of the model.

I for one would love a video game where you're playing in a psychedelic, dream-like fugue.

eboynyc32 · 21 days ago

Wow. How do we know if we’re not in Genie 4 right now.

SequoiaHope · 21 days ago

Yeah this is going to be excellent for robotics because it’s good enough to clear the reality gap (visually - physics would be another story).

ProofHouse · 21 days ago

Curious how multiplayer would possibly work not only logistically, but technically and from a game play POV

resters · 21 days ago

consider the hardware DOOM runs on. 720p would only be a true test of capability if every bit of possible detail was used.

tugn8r · 21 days ago

But that was always going to be the case?

Reality is not composed of words, syntax, and semantics. A human modal is.

Other human modals are sensory only, no language.

So vision learning and energy models that capture the energy to achieve a visual, audio, physical robotics behavior are the only real goal.

Software is for those who read the manual with their new NES game. Where are the words inside us?

Statistical physics of energy to make machine draw the glyphs of language not opionated clustering of language that will close the keyboard and mouse input loop. We're like replicating human work habits. Those are real physical behaviors. Not just descriptions in words.

I'm still struggling to imagine a world where predicting the next pixel wins over over building a deterministic thing that is then ran.

Eg: Using AI to generate textures, wire models, motion sequences which themselves sum up to something that local graphics card can then render into a scene.

I'm very much not an expert in this space, but to me it seems if you do that, then you can tweak the wire model, the texture, move the camera to wherever you want in the scene etc.

wolttam · 21 days ago

At some point it will be computationally cheaper to predict the next pixel than to classically render the scene, when talking about scenes beyond a certain graphical fidelity.

The model can infinitely zoom in to some surface and depict(/predict) what would really be there. Trying to do so via classical rendering introduces many technical challenges

m4nu3l · 20 days ago

I work in game physics in the AAA industry, and I have studied and experimented with ML on my own. I'm sceptical that that's going to happen.

Imagine you want to build a model that renders a scene with the same style and quality of rasterisation. The fastest way to project a point on the screen is to apply a matrix multiplication. If the model needs to keep the same level of spatial consistency as the resterizer, it has to reproject points in space somehow.

But a model is made of a huge number of matrix multiplications interspersed by non-linear activations. Because of these non-linearities, it can't map a single matrix multiplication to its underlying multiplications. It has to recover the linearity by approximating the transformation with many more operations.

Now, I know that transformers can exploit superposition when processing a lot of data. I also know neural networks could come up with all sorts of heuristics and approximations based on distance or other criteria. However, I've read multiple papers showing that large models have a large number of useless parameters (the last one showed that their model could be reduced to just 4% of the original parameters, but the process they used requires re-training the model from scratch many times in a deterministic way, so it's not practical for large models).

This doesn't mean we might not end up using them anyway for real-time rendering. We could accept the trade-off and give up some coherence for more flexibility. Or, given enough computational power, a larger model could be coherent enough for the human eye, while its much larger cost will be justified by its flexibility. In a way like analogous systems are much faster than digital ones, but we use digital ones anyway because they can be reprogrammed.

With frame prediction and upscaling, we have this trade-off already.

brap · 21 days ago

I imagine a future where the “high level” stuff in the environment is pre defined by a human (with or without assistance from AI), and then AI sort of fills in the blanks on the fly.

So for example, a game designer might tell the AI the floor is made of mud, but won’t tell the AI what it looks like if the player decides to dig a 10 ft hole in the mud, or how difficult it is to dig, or what the mud sounds like when thrown out of the hole, or what a certain NPC might say when thrown down the hole, etc.

lossolo · 21 days ago

> At some point it will be computationally cheaper to predict the next pixel than to classically render the scene,

This is already happening to some extent, some games struggle to reach 60 FPS at 4K resolution with maximum graphics settings using traditional rasterization alone, so technologies like DLSS 3 frame generation are used to improve performance.

abrookewood · 21 days ago

Can you explain why this is the case? I don't understand why.

Terretta · 21 days ago

> I'm still struggling to imagine a world where predicting the next pixel wins over building a deterministic thing that is then run.

Disconcerting that it's daydreaming rather than authoring?

cma · 21 days ago

Pass-through AR that does more than add a few things to the scene or very crude relighting from a scan mesh. Classic methods aren't great at it and tend to feel like you are just sticking some of of place objects on top of things. apple gives a lighting estimate to make it sit better in the scene, but may already be using some AI for that (I think it's just a cube map or a spherical harmonic based thing). But we'll want to do much more than matching lighting.

energy123 · 21 days ago

"Wins" in the sense of being useful, or being on the critical path to AGI?