Readit News logoReadit News
modeless · 21 days ago
Consistency over multiple minutes and it runs in real time at 720p? I did not expect world models to be this good yet.

> Genie 3’s consistency is an emergent capability

So this just happened from scaling the model, rather than being a consequence of deliberate architecture changes?

Edit: here is some commentary on limitations from someone who tried it: https://x.com/tejasdkulkarni/status/1952737669894574264

> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).

> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work

> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)

> - Action space is limited

> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.

Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...

ojosilva · 21 days ago
Gaming is certainly a use case, but I think this is primarily coming as synthetic data generation for Google's robots training in warehouses:

https://www.theguardian.com/technology/2025/aug/05/google-st...

Gemini Robot launch 4 mo ago:

https://news.ycombinator.com/item?id=43344082

onlyrealcuzzo · 20 days ago
This also seems pretty valuable to create CGI cut scenes...
kkukshtel · 21 days ago
I similarly am surprised at how fast they are progressing. I wrote this piece a few months ago about how I think steering world model output is the next realm of AAA gaming:

https://kylekukshtel.com/diffusion-aaa-gamedev-doom-minecraf...

But even when I wrote that I thought things were still a few years out. I facetiously said that Rockstar would be nerd-sniped on GTA6 by a world model, which sounded crazy a few months ago. But seeing the progress already made since GameNGen and knowing GTA6 is still a year away... maybe it will actually happen.

ewoodrich · 21 days ago
> Rockstar would be nerd-sniped on GTA6 by a world model

I'm having trouble parsing your meaning here.

GTA isn't really a "drive on the street simulator", is it? There is deliberate creative and artistic vision that makes the series so enjoyable to play even decades after release, despite the graphics quality becoming more dated every year by AAA standards.

Are you saying someone would "vibe model" a GTAish clone with modern graphics that would overtake the actual GTA6 in popularity? That seems extremely unlikely to me.

corimaith · 21 days ago
The future of games was MMORPGs and RPG-ization in general as other genres adopted progression systems. But the former two are simply too expensive and risky even today for AAA to develop. Which brings us to another point, the problem with Western AAA is more about high levels of risk aversion, which is what's really feeding the lack of imaginative. And that's more to do with the economics of opportunity cost to the S&P 500.

Anyways, crafting pretty looking worlds is one thing, but you still need to fill them in with something worth doing, and that's something we haven't really figured out. That's one of the reasons why the sandbox MMORPG was developed as opposed to "themeparks". The underlying systems, the backend is the real meat here. At most with the world models right now is that you're replacing 3d artists and animators, but I would not say that is a real bottleneck in relation to one's own limitations.

throwmeaway222 · 21 days ago
I'm trying to wrap my head around this since we're still seeing text spit out slowly ( I mean slowly as in 1000's of tokens a second)

I'm starting to think some of the names behind LLMs/GenAI are cover names for aliens and any actual humans involved have signed an NDA that comes with millions of dollars and a death warrant if disobeyed.

ivape · 21 days ago
So this just happened from scaling the model

Unbelievable. How is this not a miracle? So we're just stumbling onto breakthroughs?

silveraxe93 · 21 days ago
Is it actually unbelievable?

It's basically what every major AI lab head is saying from the start. It's the peanut gallery that keeps saying they are lying to get funding.

shreezus · 21 days ago
There are a lot of "interesting" emergent behaviors that happen just a result of scaling.

Kind of like how a single neuron doesn't do much, but connect 100 billion of them and well...

spaceman_2020 · 21 days ago
becoming really, really hard to refute the Simulation Theory
kfarr · 21 days ago
Bitter lesson strikes again!
nxobject · 21 days ago
_Especially_ given the goal of a world model using a rasters-only frame-by-frame approach. Holy shit.
diwank · 21 days ago
> Future robots may learn in their dreams...

So prescient. I definitely think this will be a thing in the near future ~12-18 months time horizon

casenmgreen · 21 days ago
I may be wrong, but this seems to make no sense.

A neural net can produce information outside of its original data set, but it is all and directly derived from that initial set. There are fundamental information constraints here. You cannot use a neural net to itself generate from its existing data set wholly new and original full quality training data for itself.

You can use a neural net to generate data, and you can train a net on that data, but you'll end up with something which is no good.

dingnuts · 21 days ago
what is a robot dream when there is clearly no consciousness?

What's with this insane desire for anthropomorphism? What do you even MEAN learn in its dreams? Fine-tuning overnight? Just say that!

neom · 21 days ago
I'm invested in a startup that is doing something unrelated robotics, but they're spending a lot of time in Shenzhen, I keep a very close eye on robotics and was talking to their CTO about what he is seeing in China, versions of this are already being implemented.

Deleted Comment

Aco- · 21 days ago
"Do Androids Dream of Electric Sheep?"
casenmgreen · 21 days ago
The guy who tried was invite by Google to try it.

He seems to me too enthusiastic, such that I feel Google asked him in particular because they trusted him to write very positively.

alphabetting · 21 days ago
I doubt there was a condition on writing positively. Other people who tested have said this won't replace engines. https://togelius.blogspot.com/2025/08/genie-3-and-future-of-...
echelon · 21 days ago
I don't know. I wasn't there and I'm excited.

I think this puts Epic Games, Nintendo, and the whole lot into a very tough spot if this tech takes off.

I don't see how Unreal Engine, with its voluminous and labyrinthine tomes of impenetrable legacy C++ code, survives this. Unreal Engine is a mess, gamers are unhappy about it, and it's a PITA to develop with. I certainly hate working with it.

Innovator's Dilemma fast approaching the entire gaming industry and they don't even see it coming it's happening so fast.

Exciting that building games could become as easy as having the idea itself. I'm imagining something like VRChat or Roblox or Fortnite, but where new things are simply spoken into existence.

It's absolutely terrifying that Google has this much power.

csomar · 21 days ago
Also he is ex-Google Mind. Like the worst kind of pick you can make when there are dozens of eligible journalists out there.
forrestthewoods · 21 days ago
> this is a clear glimpse into the future.

Not for video games it isn’t.

dlivingston · 21 days ago
Unless and until state can be stored outside of the model.

I for one would love a video game where you're playing in a psychedelic, dream-like fugue.

eboynyc32 · 21 days ago
Wow. How do we know if we’re not in Genie 4 right now.
SequoiaHope · 21 days ago
Yeah this is going to be excellent for robotics because it’s good enough to clear the reality gap (visually - physics would be another story).
ProofHouse · 21 days ago
Curious how multiplayer would possibly work not only logistically, but technically and from a game play POV
resters · 21 days ago
consider the hardware DOOM runs on. 720p would only be a true test of capability if every bit of possible detail was used.
tugn8r · 21 days ago
But that was always going to be the case?

Reality is not composed of words, syntax, and semantics. A human modal is.

Other human modals are sensory only, no language.

So vision learning and energy models that capture the energy to achieve a visual, audio, physical robotics behavior are the only real goal.

Software is for those who read the manual with their new NES game. Where are the words inside us?

Statistical physics of energy to make machine draw the glyphs of language not opionated clustering of language that will close the keyboard and mouse input loop. We're like replicating human work habits. Those are real physical behaviors. Not just descriptions in words.

ollin · 21 days ago
This is very encouraging progress, and probably what Demis was teasing [1] last month. A few speculations on technical details based on staring at the released clips:

1. You can see fine textures "jump" every 4 frames - which means they're most likely using a 4x-temporal-downscaling VAE with at least 4-frame interaction latency (unless the VAE is also control-conditional). Unfortunately I didn't see any real-time footage to confirm the latency (at one point they intercut screen recordings with "fingers on keyboard" b-roll? hmm).

2. There's some 16x16 spatial blocking during fast motion which could mean 16x16 spatial downscaling in the VAE. Combined with 1, this would mean 24x1280x720/(4x16x16) = 21,600 tokens per second, or around 1.3 million tokens per minute.

3. The first frame of each clip looks a bit sharper and less videogamey than later stationary frames, which suggests this is could be a combination of text-to-image + image-to-world system (where the t2i system is trained on general data but the i2w system is finetuned on game data with labeled controls). Noticeable in e.g. the dirt/textures in [2]. I still noticed some trend towards more contrast/saturation over time, but it's not as bad as in other autoregressive video models I've seen.

[1] https://x.com/demishassabis/status/1940248521111961988

[2] https://deepmind.google/api/blob/website/media/genie_environ...

ollin · 21 days ago
Regarding latency, I found a live video of gameplay here [1] and it looks like closer to 1.1s keypress-to-photon latency (33 frames @ 30fps) based on when the onscreen keys start lighting up vs when the camera starts moving. This writeup [2] from someone who tried the Genie 3 research preview mentions that "while there is some control lag, I was told that this is due to the infrastructure used to serve the model rather than the model itself" so a lot of this latency may be added by their client/server streaming setup.

[1] https://x.com/holynski_/status/1952756737800651144

[2] https://togelius.blogspot.com/2025/08/genie-3-and-future-of-...

rotexo · 21 days ago
You know that thing in anxiety dreams where you feel very uncoordinated and your attempts to manipulate your surroundings result in unpredictable consequences? Like you try to slam on the brake pedal but your car doesn’t slow down, or you’re trying to get a leash on your dog to lead it out of a dangerous situation and you keep failing to hook it on the collar? Maybe that’s extra latency because your brain is trying to render the environment at the same time as it is acting.
blibble · 21 days ago
> I found a live video of gameplay here [1] and it looks like closer to 1.1s keypress-to-photon latency (33 frames @ 30fps) based on when the onscreen keys start lighting up vs when the camera starts moving.

so better than Stadia?

addisonj · 21 days ago
Really impressive... but wow this is light on details.

While I don't fully align with the sentiment of other commenters that this is meaningless unless you can go hands on... it is crazy to think of how different this announcement is than a few years ago when this would be accompanied by an actual paper that shared the research.

Instead... we get this thing that has a few aspects of a paper - authors, demos, a bibtex citation(!) - but none of the actual research shared.

I was discussing with a friend that my biggest concern with AI right now is not that it isn't capable of doing things... but that we switched from research/academic mode to full value extraction so fast that we are way out over our skis in terms of what is being promised, which, in the realm of exciting new field of academic research is pretty low-stakes all things considered... to being terrifying when we bet policy and economics on it.

To be clear, I am not against commercialization, but the dissonance of this product announcement made to look like research written in this way at the same time that one of the preeminent mathematicians writing about how our shift in funding of real academic research is having real, serious impact is... uh... not confidence inspiring for the long term.

Vipitis · 21 days ago
I wish they would share more about how it works. Maybe a reseach paper for once? we didn't even get a technical report.

From my best guess: it's a video generation model like the ones we already head. But they condition inputs (movement direction, viewangle). Perhaps they aren't relative inputs but absolute and there is a bit of state simulation going on? [although some demo videos show physics interactions like bumping against objects - so that might be unlikely, or maybe it's 2D and the up axis is generated??].

It's clearly trained on a game engine as I can see screenspace reflection artefacts being learned. They also train on photoscans/splats... some non realistic elements look significantly lower fidelity too..

some inconsistencies I have noticed in the demo videos:

- wingsuit discollcusions are lower fidelity (maybe initialized by high resolution image?)

- garden demo has different "geometry" for each variation, look at the 2nd hose only existing in one version (new "geometry" is made up when first looked at, not beforehand).

- school demo has half a caroutside the window? and a suspiciously repeating pattern (infinite loop patterns are common in transformer models that lack parameters, so they can scale this even more! also might be greedy sampling for stability)

- museum scene has odd reflection in the amethyst box, like the rear mammoth doesn't have reflections on the right most side of the box before it's shown through the box. The tusk reflection just pops in. This isn't fresnel effect.

3abiton · 21 days ago
I feel after the 2017 transformer paper, and its impact on current state of AI, and google stocks, it seems Google is much more hesistant to keep things under their wings for now. Sadly, so.
maerF0x0 · 21 days ago
I'm still struggling to imagine a world where predicting the next pixel wins over over building a deterministic thing that is then ran.

Eg: Using AI to generate textures, wire models, motion sequences which themselves sum up to something that local graphics card can then render into a scene.

I'm very much not an expert in this space, but to me it seems if you do that, then you can tweak the wire model, the texture, move the camera to wherever you want in the scene etc.

wolttam · 21 days ago
At some point it will be computationally cheaper to predict the next pixel than to classically render the scene, when talking about scenes beyond a certain graphical fidelity.

The model can infinitely zoom in to some surface and depict(/predict) what would really be there. Trying to do so via classical rendering introduces many technical challenges

m4nu3l · 20 days ago
I work in game physics in the AAA industry, and I have studied and experimented with ML on my own. I'm sceptical that that's going to happen.

Imagine you want to build a model that renders a scene with the same style and quality of rasterisation. The fastest way to project a point on the screen is to apply a matrix multiplication. If the model needs to keep the same level of spatial consistency as the resterizer, it has to reproject points in space somehow.

But a model is made of a huge number of matrix multiplications interspersed by non-linear activations. Because of these non-linearities, it can't map a single matrix multiplication to its underlying multiplications. It has to recover the linearity by approximating the transformation with many more operations.

Now, I know that transformers can exploit superposition when processing a lot of data. I also know neural networks could come up with all sorts of heuristics and approximations based on distance or other criteria. However, I've read multiple papers showing that large models have a large number of useless parameters (the last one showed that their model could be reduced to just 4% of the original parameters, but the process they used requires re-training the model from scratch many times in a deterministic way, so it's not practical for large models).

This doesn't mean we might not end up using them anyway for real-time rendering. We could accept the trade-off and give up some coherence for more flexibility. Or, given enough computational power, a larger model could be coherent enough for the human eye, while its much larger cost will be justified by its flexibility. In a way like analogous systems are much faster than digital ones, but we use digital ones anyway because they can be reprogrammed.

With frame prediction and upscaling, we have this trade-off already.

brap · 21 days ago
I imagine a future where the “high level” stuff in the environment is pre defined by a human (with or without assistance from AI), and then AI sort of fills in the blanks on the fly.

So for example, a game designer might tell the AI the floor is made of mud, but won’t tell the AI what it looks like if the player decides to dig a 10 ft hole in the mud, or how difficult it is to dig, or what the mud sounds like when thrown out of the hole, or what a certain NPC might say when thrown down the hole, etc.

lossolo · 21 days ago
> At some point it will be computationally cheaper to predict the next pixel than to classically render the scene,

This is already happening to some extent, some games struggle to reach 60 FPS at 4K resolution with maximum graphics settings using traditional rasterization alone, so technologies like DLSS 3 frame generation are used to improve performance.

abrookewood · 21 days ago
Can you explain why this is the case? I don't understand why.
Terretta · 21 days ago
> I'm still struggling to imagine a world where predicting the next pixel wins over building a deterministic thing that is then run.

Disconcerting that it's daydreaming rather than authoring?

cma · 21 days ago
Pass-through AR that does more than add a few things to the scene or very crude relighting from a scan mesh. Classic methods aren't great at it and tend to feel like you are just sticking some of of place objects on top of things. apple gives a lighting estimate to make it sit better in the scene, but may already be using some AI for that (I think it's just a cube map or a spherical harmonic based thing). But we'll want to do much more than matching lighting.
energy123 · 21 days ago
"Wins" in the sense of being useful, or being on the critical path to AGI?
yanis_t · 21 days ago
> Text rendering. Clear and legible text is often only generated when provided in the input world description.

Reminds me of when image AIs weren't able to generate text. It wasn't too long until they fixed it.

reactordev · 21 days ago
And made hands 10x worse. Now hands are good, text is good, image is good, so we’ll have to play where’s Waldo all over again trying to find the flaw. It’s going to eventually get to a point where it’s one of those infinite zoom videos where the AI watermark is the size of 1/3rd of a pixel.

What I’d really love to see more of is augmented video. Like, the stormtrooper vlogs. Runway has some good stuff but man is it all expensive.

maerF0x0 · 21 days ago
someone mentioned physics. Which might be an interesting conundrum because an important characteristic of games is that some part of them is both novel and unrealistic. (They're less fun if they're too real)
TheAceOfHearts · 21 days ago
I wouldn't say that the text problem has been fully fixed. It has certainly gotten a lot better, but even gpt-image-1 still fails occasionally when generating text.
yencabulator · 21 days ago
Note that the prompt and the generated chalkboard disagree on whether there's a dash or not.
qwertox · 21 days ago
This is revolutionary. I mean, we already could see this coming, but now it's here. With limitations, but this is the beginning.

In game engines it's the engineers, the software developers who make sure triangles are at the perfect location, mapping to the correct pixels, but this here, this is now like a drawing made by a computer, frame by frame, with no triangles computed.

Oarch · 21 days ago
I don't think I've ever seen a presentation that's had me question reality multiple times before. My mind is suitably blown.