> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).
> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work
> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)
> - Action space is limited
> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.
Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...
I similarly am surprised at how fast they are progressing. I wrote this piece a few months ago about how I think steering world model output is the next realm of AAA gaming:
But even when I wrote that I thought things were still a few years out. I facetiously said that Rockstar would be nerd-sniped on GTA6 by a world model, which sounded crazy a few months ago. But seeing the progress already made since GameNGen and knowing GTA6 is still a year away... maybe it will actually happen.
> Rockstar would be nerd-sniped on GTA6 by a world model
I'm having trouble parsing your meaning here.
GTA isn't really a "drive on the street simulator", is it? There is deliberate creative and artistic vision that makes the series so enjoyable to play even decades after release, despite the graphics quality becoming more dated every year by AAA standards.
Are you saying someone would "vibe model" a GTAish clone with modern graphics that would overtake the actual GTA6 in popularity? That seems extremely unlikely to me.
The future of games was MMORPGs and RPG-ization in general as other genres adopted progression systems. But the former two are simply too expensive and risky even today for AAA to develop. Which brings us to another point, the problem with Western AAA is more about high levels of risk aversion, which is what's really feeding the lack of imaginative. And that's more to do with the economics of opportunity cost to the S&P 500.
Anyways, crafting pretty looking worlds is one thing, but you still need to fill them in with something worth doing, and that's something we haven't really figured out. That's one of the reasons why the sandbox MMORPG was developed as opposed to "themeparks". The underlying systems, the backend is the real meat here. At most with the world models right now is that you're replacing 3d artists and animators, but I would not say that is a real bottleneck in relation to one's own limitations.
I'm trying to wrap my head around this since we're still seeing text spit out slowly ( I mean slowly as in 1000's of tokens a second)
I'm starting to think some of the names behind LLMs/GenAI are cover names for aliens and any actual humans involved have signed an NDA that comes with millions of dollars and a death warrant if disobeyed.
A neural net can produce information outside of its original data set, but it is all and directly derived from that initial set. There are fundamental information constraints here. You cannot use a neural net to itself generate from its existing data set wholly new and original full quality training data for itself.
You can use a neural net to generate data, and you can train a net on that data, but you'll end up with something which is no good.
I'm invested in a startup that is doing something unrelated robotics, but they're spending a lot of time in Shenzhen, I keep a very close eye on robotics and was talking to their CTO about what he is seeing in China, versions of this are already being implemented.
I think this puts Epic Games, Nintendo, and the whole lot into a very tough spot if this tech takes off.
I don't see how Unreal Engine, with its voluminous and labyrinthine tomes of impenetrable legacy C++ code, survives this. Unreal Engine is a mess, gamers are unhappy about it, and it's a PITA to develop with. I certainly hate working with it.
Innovator's Dilemma fast approaching the entire gaming industry and they don't even see it coming it's happening so fast.
Exciting that building games could become as easy as having the idea itself. I'm imagining something like VRChat or Roblox or Fortnite, but where new things are simply spoken into existence.
It's absolutely terrifying that Google has this much power.
Reality is not composed of words, syntax, and semantics. A human modal is.
Other human modals are sensory only, no language.
So vision learning and energy models that capture the energy to achieve a visual, audio, physical robotics behavior are the only real goal.
Software is for those who read the manual with their new NES game. Where are the words inside us?
Statistical physics of energy to make machine draw the glyphs of language not opionated clustering of language that will close the keyboard and mouse input loop. We're like replicating human work habits. Those are real physical behaviors. Not just descriptions in words.
This is very encouraging progress, and probably what Demis was teasing [1] last month. A few speculations on technical details based on staring at the released clips:
1. You can see fine textures "jump" every 4 frames - which means they're most likely using a 4x-temporal-downscaling VAE with at least 4-frame interaction latency (unless the VAE is also control-conditional). Unfortunately I didn't see any real-time footage to confirm the latency (at one point they intercut screen recordings with "fingers on keyboard" b-roll? hmm).
2. There's some 16x16 spatial blocking during fast motion which could mean 16x16 spatial downscaling in the VAE. Combined with 1, this would mean 24x1280x720/(4x16x16) = 21,600 tokens per second, or around 1.3 million tokens per minute.
3. The first frame of each clip looks a bit sharper and less videogamey than later stationary frames, which suggests this is could be a combination of text-to-image + image-to-world system (where the t2i system is trained on general data but the i2w system is finetuned on game data with labeled controls). Noticeable in e.g. the dirt/textures in [2]. I still noticed some trend towards more contrast/saturation over time, but it's not as bad as in other autoregressive video models I've seen.
Regarding latency, I found a live video of gameplay here [1] and it looks like closer to 1.1s keypress-to-photon latency (33 frames @ 30fps) based on when the onscreen keys start lighting up vs when the camera starts moving. This writeup [2] from someone who tried the Genie 3 research preview mentions that "while there is some control lag, I was told that this is due to the infrastructure used to serve the model rather than the model itself" so a lot of this latency may be added by their client/server streaming setup.
You know that thing in anxiety dreams where you feel very uncoordinated and your attempts to manipulate your surroundings result in unpredictable consequences? Like you try to slam on the brake pedal but your car doesn’t slow down, or you’re trying to get a leash on your dog to lead it out of a dangerous situation and you keep failing to hook it on the collar? Maybe that’s extra latency because your brain is trying to render the environment at the same time as it is acting.
> I found a live video of gameplay here [1] and it looks like closer to 1.1s keypress-to-photon latency (33 frames @ 30fps) based on when the onscreen keys start lighting up vs when the camera starts moving.
Really impressive... but wow this is light on details.
While I don't fully align with the sentiment of other commenters that this is meaningless unless you can go hands on... it is crazy to think of how different this announcement is than a few years ago when this would be accompanied by an actual paper that shared the research.
Instead... we get this thing that has a few aspects of a paper - authors, demos, a bibtex citation(!) - but none of the actual research shared.
I was discussing with a friend that my biggest concern with AI right now is not that it isn't capable of doing things... but that we switched from research/academic mode to full value extraction so fast that we are way out over our skis in terms of what is being promised, which, in the realm of exciting new field of academic research is pretty low-stakes all things considered... to being terrifying when we bet policy and economics on it.
To be clear, I am not against commercialization, but the dissonance of this product announcement made to look like research written in this way at the same time that one of the preeminent mathematicians writing about how our shift in funding of real academic research is having real, serious impact is... uh... not confidence inspiring for the long term.
I wish they would share more about how it works. Maybe a reseach paper for once? we didn't even get a technical report.
From my best guess: it's a video generation model like the ones we already head. But they condition inputs (movement direction, viewangle). Perhaps they aren't relative inputs but absolute and there is a bit of state simulation going on? [although some demo videos show physics interactions like bumping against objects - so that might be unlikely, or maybe it's 2D and the up axis is generated??].
It's clearly trained on a game engine as I can see screenspace reflection artefacts being learned. They also train on photoscans/splats... some non realistic elements look significantly lower fidelity too..
some inconsistencies I have noticed in the demo videos:
- wingsuit discollcusions are lower fidelity (maybe initialized by high resolution image?)
- garden demo has different "geometry" for each variation, look at the 2nd hose only existing in one version (new "geometry" is made up when first looked at, not beforehand).
- school demo has half a caroutside the window? and a suspiciously repeating pattern (infinite loop patterns are common in transformer models that lack parameters, so they can scale this even more! also might be greedy sampling for stability)
- museum scene has odd reflection in the amethyst box, like the rear mammoth doesn't have reflections on the right most side of the box before it's shown through the box. The tusk reflection just pops in. This isn't fresnel effect.
I feel after the 2017 transformer paper, and its impact on current state of AI, and google stocks, it seems Google is much more hesistant to keep things under their wings for now. Sadly, so.
I'm still struggling to imagine a world where predicting the next pixel wins over over building a deterministic thing that is then ran.
Eg: Using AI to generate textures, wire models, motion sequences which themselves sum up to something that local graphics card can then render into a scene.
I'm very much not an expert in this space, but to me it seems if you do that, then you can tweak the wire model, the texture, move the camera to wherever you want in the scene etc.
At some point it will be computationally cheaper to predict the next pixel than to classically render the scene, when talking about scenes beyond a certain graphical fidelity.
The model can infinitely zoom in to some surface and depict(/predict) what would really be there. Trying to do so via classical rendering introduces many technical challenges
I work in game physics in the AAA industry, and I have studied and experimented with ML on my own. I'm sceptical that that's going to happen.
Imagine you want to build a model that renders a scene with the same style and quality of rasterisation. The fastest way to project a point on the screen is to apply a matrix multiplication. If the model needs to keep the same level of spatial consistency as the resterizer, it has to reproject points in space somehow.
But a model is made of a huge number of matrix multiplications interspersed by non-linear activations. Because of these non-linearities, it can't map a single matrix multiplication to its underlying multiplications. It has to recover the linearity by approximating the transformation with many more operations.
Now, I know that transformers can exploit superposition when processing a lot of data. I also know neural networks could come up with all sorts of heuristics and approximations based on distance or other criteria.
However, I've read multiple papers showing that large models have a large number of useless parameters (the last one showed that their model could be reduced to just 4% of the original parameters, but the process they used requires re-training the model from scratch many times in a deterministic way, so it's not practical for large models).
This doesn't mean we might not end up using them anyway for real-time rendering. We could accept the trade-off and give up some coherence for more flexibility.
Or, given enough computational power, a larger model could be coherent enough for the human eye, while its much larger cost will be justified by its flexibility. In a way like analogous systems are much faster than digital ones, but we use digital ones anyway because they can be reprogrammed.
With frame prediction and upscaling, we have this trade-off already.
I imagine a future where the “high level” stuff in the environment is pre defined by a human (with or without assistance from AI), and then AI sort of fills in the blanks on the fly.
So for example, a game designer might tell the AI the floor is made of mud, but won’t tell the AI what it looks like if the player decides to dig a 10 ft hole in the mud, or how difficult it is to dig, or what the mud sounds like when thrown out of the hole, or what a certain NPC might say when thrown down the hole, etc.
> At some point it will be computationally cheaper to predict the next pixel than to classically render the scene,
This is already happening to some extent, some games struggle to reach 60 FPS at 4K resolution with maximum graphics settings using traditional rasterization alone, so technologies like DLSS 3 frame generation are used to improve performance.
Pass-through AR that does more than add a few things to the scene or very crude relighting from a scan mesh. Classic methods aren't great at it and tend to feel like you are just sticking some of of place objects on top of things. apple gives a lighting estimate to make it sit better in the scene, but may already be using some AI for that (I think it's just a cube map or a spherical harmonic based thing). But we'll want to do much more than matching lighting.
And made hands 10x worse. Now hands are good, text is good, image is good, so we’ll have to play where’s Waldo all over again trying to find the flaw. It’s going to eventually get to a point where it’s one of those infinite zoom videos where the AI watermark is the size of 1/3rd of a pixel.
What I’d really love to see more of is augmented video. Like, the stormtrooper vlogs. Runway has some good stuff but man is it all expensive.
someone mentioned physics. Which might be an interesting conundrum because an important characteristic of games is that some part of them is both novel and unrealistic. (They're less fun if they're too real)
I wouldn't say that the text problem has been fully fixed. It has certainly gotten a lot better, but even gpt-image-1 still fails occasionally when generating text.
This is revolutionary. I mean, we already could see this coming, but now it's here. With limitations, but this is the beginning.
In game engines it's the engineers, the software developers who make sure triangles are at the perfect location, mapping to the correct pixels, but this here, this is now like a drawing made by a computer, frame by frame, with no triangles computed.
> Genie 3’s consistency is an emergent capability
So this just happened from scaling the model, rather than being a consequence of deliberate architecture changes?
Edit: here is some commentary on limitations from someone who tried it: https://x.com/tejasdkulkarni/status/1952737669894574264
> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).
> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work
> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)
> - Action space is limited
> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.
Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...
https://www.theguardian.com/technology/2025/aug/05/google-st...
Gemini Robot launch 4 mo ago:
https://news.ycombinator.com/item?id=43344082
https://kylekukshtel.com/diffusion-aaa-gamedev-doom-minecraf...
But even when I wrote that I thought things were still a few years out. I facetiously said that Rockstar would be nerd-sniped on GTA6 by a world model, which sounded crazy a few months ago. But seeing the progress already made since GameNGen and knowing GTA6 is still a year away... maybe it will actually happen.
I'm having trouble parsing your meaning here.
GTA isn't really a "drive on the street simulator", is it? There is deliberate creative and artistic vision that makes the series so enjoyable to play even decades after release, despite the graphics quality becoming more dated every year by AAA standards.
Are you saying someone would "vibe model" a GTAish clone with modern graphics that would overtake the actual GTA6 in popularity? That seems extremely unlikely to me.
Anyways, crafting pretty looking worlds is one thing, but you still need to fill them in with something worth doing, and that's something we haven't really figured out. That's one of the reasons why the sandbox MMORPG was developed as opposed to "themeparks". The underlying systems, the backend is the real meat here. At most with the world models right now is that you're replacing 3d artists and animators, but I would not say that is a real bottleneck in relation to one's own limitations.
I'm starting to think some of the names behind LLMs/GenAI are cover names for aliens and any actual humans involved have signed an NDA that comes with millions of dollars and a death warrant if disobeyed.
Unbelievable. How is this not a miracle? So we're just stumbling onto breakthroughs?
It's basically what every major AI lab head is saying from the start. It's the peanut gallery that keeps saying they are lying to get funding.
Kind of like how a single neuron doesn't do much, but connect 100 billion of them and well...
So prescient. I definitely think this will be a thing in the near future ~12-18 months time horizon
A neural net can produce information outside of its original data set, but it is all and directly derived from that initial set. There are fundamental information constraints here. You cannot use a neural net to itself generate from its existing data set wholly new and original full quality training data for itself.
You can use a neural net to generate data, and you can train a net on that data, but you'll end up with something which is no good.
What's with this insane desire for anthropomorphism? What do you even MEAN learn in its dreams? Fine-tuning overnight? Just say that!
Deleted Comment
He seems to me too enthusiastic, such that I feel Google asked him in particular because they trusted him to write very positively.
I think this puts Epic Games, Nintendo, and the whole lot into a very tough spot if this tech takes off.
I don't see how Unreal Engine, with its voluminous and labyrinthine tomes of impenetrable legacy C++ code, survives this. Unreal Engine is a mess, gamers are unhappy about it, and it's a PITA to develop with. I certainly hate working with it.
Innovator's Dilemma fast approaching the entire gaming industry and they don't even see it coming it's happening so fast.
Exciting that building games could become as easy as having the idea itself. I'm imagining something like VRChat or Roblox or Fortnite, but where new things are simply spoken into existence.
It's absolutely terrifying that Google has this much power.
Not for video games it isn’t.
I for one would love a video game where you're playing in a psychedelic, dream-like fugue.
Reality is not composed of words, syntax, and semantics. A human modal is.
Other human modals are sensory only, no language.
So vision learning and energy models that capture the energy to achieve a visual, audio, physical robotics behavior are the only real goal.
Software is for those who read the manual with their new NES game. Where are the words inside us?
Statistical physics of energy to make machine draw the glyphs of language not opionated clustering of language that will close the keyboard and mouse input loop. We're like replicating human work habits. Those are real physical behaviors. Not just descriptions in words.
1. You can see fine textures "jump" every 4 frames - which means they're most likely using a 4x-temporal-downscaling VAE with at least 4-frame interaction latency (unless the VAE is also control-conditional). Unfortunately I didn't see any real-time footage to confirm the latency (at one point they intercut screen recordings with "fingers on keyboard" b-roll? hmm).
2. There's some 16x16 spatial blocking during fast motion which could mean 16x16 spatial downscaling in the VAE. Combined with 1, this would mean 24x1280x720/(4x16x16) = 21,600 tokens per second, or around 1.3 million tokens per minute.
3. The first frame of each clip looks a bit sharper and less videogamey than later stationary frames, which suggests this is could be a combination of text-to-image + image-to-world system (where the t2i system is trained on general data but the i2w system is finetuned on game data with labeled controls). Noticeable in e.g. the dirt/textures in [2]. I still noticed some trend towards more contrast/saturation over time, but it's not as bad as in other autoregressive video models I've seen.
[1] https://x.com/demishassabis/status/1940248521111961988
[2] https://deepmind.google/api/blob/website/media/genie_environ...
[1] https://x.com/holynski_/status/1952756737800651144
[2] https://togelius.blogspot.com/2025/08/genie-3-and-future-of-...
so better than Stadia?
While I don't fully align with the sentiment of other commenters that this is meaningless unless you can go hands on... it is crazy to think of how different this announcement is than a few years ago when this would be accompanied by an actual paper that shared the research.
Instead... we get this thing that has a few aspects of a paper - authors, demos, a bibtex citation(!) - but none of the actual research shared.
I was discussing with a friend that my biggest concern with AI right now is not that it isn't capable of doing things... but that we switched from research/academic mode to full value extraction so fast that we are way out over our skis in terms of what is being promised, which, in the realm of exciting new field of academic research is pretty low-stakes all things considered... to being terrifying when we bet policy and economics on it.
To be clear, I am not against commercialization, but the dissonance of this product announcement made to look like research written in this way at the same time that one of the preeminent mathematicians writing about how our shift in funding of real academic research is having real, serious impact is... uh... not confidence inspiring for the long term.
From my best guess: it's a video generation model like the ones we already head. But they condition inputs (movement direction, viewangle). Perhaps they aren't relative inputs but absolute and there is a bit of state simulation going on? [although some demo videos show physics interactions like bumping against objects - so that might be unlikely, or maybe it's 2D and the up axis is generated??].
It's clearly trained on a game engine as I can see screenspace reflection artefacts being learned. They also train on photoscans/splats... some non realistic elements look significantly lower fidelity too..
some inconsistencies I have noticed in the demo videos:
- wingsuit discollcusions are lower fidelity (maybe initialized by high resolution image?)
- garden demo has different "geometry" for each variation, look at the 2nd hose only existing in one version (new "geometry" is made up when first looked at, not beforehand).
- school demo has half a caroutside the window? and a suspiciously repeating pattern (infinite loop patterns are common in transformer models that lack parameters, so they can scale this even more! also might be greedy sampling for stability)
- museum scene has odd reflection in the amethyst box, like the rear mammoth doesn't have reflections on the right most side of the box before it's shown through the box. The tusk reflection just pops in. This isn't fresnel effect.
Eg: Using AI to generate textures, wire models, motion sequences which themselves sum up to something that local graphics card can then render into a scene.
I'm very much not an expert in this space, but to me it seems if you do that, then you can tweak the wire model, the texture, move the camera to wherever you want in the scene etc.
The model can infinitely zoom in to some surface and depict(/predict) what would really be there. Trying to do so via classical rendering introduces many technical challenges
Imagine you want to build a model that renders a scene with the same style and quality of rasterisation. The fastest way to project a point on the screen is to apply a matrix multiplication. If the model needs to keep the same level of spatial consistency as the resterizer, it has to reproject points in space somehow.
But a model is made of a huge number of matrix multiplications interspersed by non-linear activations. Because of these non-linearities, it can't map a single matrix multiplication to its underlying multiplications. It has to recover the linearity by approximating the transformation with many more operations.
Now, I know that transformers can exploit superposition when processing a lot of data. I also know neural networks could come up with all sorts of heuristics and approximations based on distance or other criteria. However, I've read multiple papers showing that large models have a large number of useless parameters (the last one showed that their model could be reduced to just 4% of the original parameters, but the process they used requires re-training the model from scratch many times in a deterministic way, so it's not practical for large models).
This doesn't mean we might not end up using them anyway for real-time rendering. We could accept the trade-off and give up some coherence for more flexibility. Or, given enough computational power, a larger model could be coherent enough for the human eye, while its much larger cost will be justified by its flexibility. In a way like analogous systems are much faster than digital ones, but we use digital ones anyway because they can be reprogrammed.
With frame prediction and upscaling, we have this trade-off already.
So for example, a game designer might tell the AI the floor is made of mud, but won’t tell the AI what it looks like if the player decides to dig a 10 ft hole in the mud, or how difficult it is to dig, or what the mud sounds like when thrown out of the hole, or what a certain NPC might say when thrown down the hole, etc.
This is already happening to some extent, some games struggle to reach 60 FPS at 4K resolution with maximum graphics settings using traditional rasterization alone, so technologies like DLSS 3 frame generation are used to improve performance.
Disconcerting that it's daydreaming rather than authoring?
Reminds me of when image AIs weren't able to generate text. It wasn't too long until they fixed it.
What I’d really love to see more of is augmented video. Like, the stormtrooper vlogs. Runway has some good stuff but man is it all expensive.
In game engines it's the engineers, the software developers who make sure triangles are at the perfect location, mapping to the correct pixels, but this here, this is now like a drawing made by a computer, frame by frame, with no triangles computed.