So, this is surprising. Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.
The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.
That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.
Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.
> Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected
To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).
This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.
Great observation. And not entirely unlike normal human visual perception which is notoriously vulnerable to missing highly salient information; I'm reminded of the "gorillas in our midst" work by Dan Simons and Christopher Chabris [0].
I saw a longer video of this that Ethan Mollick posted and in that one, the sequences are longer and they do appear to demonstrate a fair amount of consistency. The clips don't backtrack in the summary video on the paper's home page because they're showing a number of district environments but you only get a few seconds of each.
If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.
What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?
A dialogue between two parties with different functionality so to speak.
Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.
Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.
So for the next iteration, they should add a minimap overlay (perhaps on a side channel) - it should help the model give more consistent output in any given location. Right now, the game is very much like a lucid dream - the universe makes sense from moment to moment, but without outside reference, everything that falls out of short-term memory (few frames here) gets reimagined.
I don't see this as something that would be hard to overcome. Sora for instance has already shown the ability for a diffusion model to maintain object permanence. Flux recently too has shown the ability to render the same person in many different poses or images.
Just want to clarify a couple possible misconceptions:
The diffusion model doesn’t maintain any state itself, though its weights may encode some notion of cause/effect. It just renders one frame at a time (after all it’s a text to image model, not text to video). Instead of text, the previous states and frames are provided as inputs to the model to predict the next frame.
Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.
De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.
In this case it helps prevent auto-regressive drift due to the accumulation of small errors from the randomness inherent in generative diffusion models. Figure 4 shows such drift happening when a player is standing still.
The concept is that if you train a Diffusion model by feeding all the possible frames seen in the game.
The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.
Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.
But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.
The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.
It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.
But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".
I would call it the world's least efficient video compression.
What I would like to see is the actual predictive strength, aka imagination, which I did not notice mentioned in the abstract. The model is trained on a set of classic maps. What would it do, given a few frames of gameplay on an unfamiliar map as input? How well could it imagine what happens next?
> But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".
It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.
So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.
No, it’s predicting the next frame conditioned on past frames AND player actions! This is clear from the article. Mere video generation would be nothing new.
It's a memory of a video looped to controls, so frame 1 is "I wonder how would it look if the player pressed D instead of W", then the frame 2 is based on frame 1, etc. and couple frames in, it's already not remembering, but imagining the gameplay on the fly. It's not prerecorded, it responds to inputs during generation. That's what makes it a game engine.
They could down convert the entire model to only utilize the subset of matrix components from stable diffusion. This approach may be able to improve internet bandwidth efficiency assuming consumers in the future have powerful enough computers.
> Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.
A mistake people make all the time is that massive companies will put all their resources toward every project. This paper was written by four co-authors. They probably got a good amount of resources, but they still had to share in the pool allocated to their research department.
Even Google only has one Gemini (in a few versions).
Nicely summarised. Another important thing that clearly standsout (not to undermine the efforts and work gone into this) is the fact that more and more we are now seeing larger and more complex building blocks emerging (first it was embedding models then encoder decoder layers and now whole models are being duck-taped for even powerful pipelines). AI/DL ecosystem is growing on a nice trajectory.
Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).
PS: Not great examples, but I hope you get the idea ;)
It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step.
Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.
It makes good sense for humans to have this ability. If we flip the argument, and see the next frame as a hypothesis for what is expected as the outcome of the current frame, then comparing this "hypothesis" with what is sensed makes it easier to process the differences, rather than the totality of the sensory input.
As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.
If that is the case, what does aphantasia tell us?
Worth noting that aphantasia doesn't necessarily extend to dreams. Anecdotally - I have pretty severe aphantasia (I can conjure milisecond glimpses of barely tangible imagery that I can't quite perceive before it's gone - but only since learning that visualisation wasn't a linguistic metaphor). I can't really simulate object rotation. I can't really 'picture' how things will look before they're drawn / built etc. However I often have highly vivid dream imagery. I also have excellent recognition of faces and places (e.g.: can't get lost in a new city). So there clearly is a lot of preconscious visualisation and image matching going on in some aphantasia cases, even where the explicit visual screen is all but absent.
What’s the aphantasia link? I’ve got aphantasia. I’m convinced though that the bit of my brain that should be making images is used for letting me ‘see’ how things are connected together very easily in my head. Also I still love games like Pictionary and can somehow draw things onto paper than I don’t really know what they look like in my head. It’s often a surprise when pen meets paper.
We are. At least that's what Lisa Feldman Barrett [1] thinks. It is worth listening to this Lex Fridman podcast: Counterintuitive Ideas About How the Brain Works [2], where she explains among other ideas how constant prediction is the most efficient way of running a brain as opposed to reaction. I never get tired of listening to her, she's such a great science communicator.
Interesting talk about the brain, but the stuff she says about free will is not a very good argument. Basically it is sort of the argument that the ancient greeks made which brings the discussion into a point where you can take both directions.
It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.
Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast.
There's been a lot of work in optimising inference speed of SD - SD Turbo, latent consistency models, Hyper-SD, etc. It is very possible to hit these frame rates now.
Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff.
Penrose (Nobel prize in physics) stipulates that quantum effects in the brain may allow a certain amount of time travel and back propagation to accomplish this.
Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple.
This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still.
Video is also higher resolution, as the pixels flip for the high resolution world by moving through it. Swivelling your head without glasses, even the blurry world contains more information in the curve of pixelchange.
After some discussion in this thread, I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly, but, to me, the way the abstract is worded heavily implied this was occurring.
It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.
There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.
We can't assess the quality of gameplay ourselves of course (since the model wasn't released), but one author said "It's playable, the videos on our project page are actual game play." (https://x.com/shlomifruchter/status/1828850796840268009) and the video on top of https://gamengen.github.io/ starts out with "these are real-time recordings of people playing the game". Based on those claims, it seems likely that they did get a playable system in front of humans by the end of the project (though perhaps not by the time the draft was uploaded to arXiv).
I also thought this, but refer back to the paper, not the abstract:
> A is the set of key presses and mouse movements…
> …to condition on actions, we simply learn an embedding A_emb for each action
So, it’s clear that in this model the diffusion process is conditioned by embedding A that is derived from user actions rather than words.
Then a noised start frame is encoded into latents and concatenated on to the noise latents as a second conditioning.
So we have a diffusion model which is trained solely on images of doom, and which is conditioned on current doom frames and user actions to produce subsequent frames.
So yes, the users are playing it.
However, it should be unsurprising that this is possible. This is effectively just a neural recording of the game. But it’s a cool tech demo.
The agent never interacts with the simulator during training or evaluation. There is no user, there is only an agent which trained to play the real game and which produced the sequences of game frames and actions that were used to train the simulator and to provide ground truth sequences of game experience for evaluation. Their evaluation metrics are all based on running short simulations in the diffusion model which are initiated with some number of conditioning frames taken from the real game engine. Statements in the paper like: "GameNGen shows that an architecture and model weights exist such that a neural model can effectively run a complex game (DOOM) interactively on existing hardware." are wildly misleading.
I wonder if they could somehow feed in a trained Gaussian splats model to this to get better images?
Since the splats are specifically designed for rendering it seems like it would be an efficient way for the image model to learn the geometry without having to encode it on the image model itself.
The paper should definitely be more clear on this point, but there's a sentence in section 5.2.3 that makes me think that this was playable and played: "When playing with the model manually, we observe that some areas are very easy for both, some areas are very hard for both, and in some the agent performs much better." It may be a failure of imagination, but I can't think of another reasonable way of interpreting "playing with the model manually".
If the generative model/simulator can run at 20FPS, then obviously in principle a human could play the game in simulation at 20 FPS. However, they do no evaluation of human play in the paper. My guess is that they limited human evals to watching short clips of play in the real engine vs the simulator (which conditions on some number of initial frames from the engine when starting each clip...) since the actual "playability" is not great.
>I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly
Well you're wrong as specified in the first video and by the authors themselves, maybe next time check better instead of writing comments with such authoritative tone of things you don't actually know.
I think someone is playing it, but it has a reduced set of inputs and they're playing it in a very specific way (slowly, avoiding looking back to places they've been) so as not to show off the flaws in the system.
The people surveyed in this study are not playing the game, they are watching extremely short video clips of the game being played and comparing them to equally short videos of the original Doom being played, to see if they can spot the difference.
I may be wrong with how it works, but I think this is just hallucinating in real time. It has no internal state per se, it knows what was on screen in the previous few frames and it knows what inputs the user is pressing, and so it generates the next frame. Like with video compression, it probably doesn't need to generate a full frame every time, just "differences".
As with all the previous AI game research, these are not games in any real sense. They fall apart when played beyond any meaningful length of time (seconds). Crucially, they are not playable by anyone other than the developers in very controlled settings. A defining attribute of any game is that it can be played.
Ehhh okay, I'm not as convinced as I was earlier. Sorry for misleading. There's been a lot of back-and-forth.
I would've really liked to see a section of the paper explicitly call out that they used humans in real time. There's a lot of sentences that led me to believe otherwise. It's clear that they used a bunch of agents to simulate gameplay where those agents submitted user inputs to affect the gameplay and they captured those inputs in their model. This made it a bit murky as to whether humans ever actually got involved.
This statement, "Our end goal is to have human players interact with our simulation. To that end, the policy π as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play"
led me to believe that while they had an ultimate goal of user input (why wouldn't they) they sufficed by approximating human input.
I was looking to refute that assumption later in the paper by hopefully reading some words on the human gameplay experience, but instead, under Results, I found:
"Human Evaluation. As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the
time (for the 1.6 seconds and 3.2 seconds clips, respectively)."
and it's like.. okay.. if you have a section in results on human evaluation, and your goal is to have humans play, then why are you talking just about humans reviewing video rather than giving some sort of feedback on the human gameplay experience - even if it's not especially positive?
Still, in the Discussion section, it mentions, "The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases." which makes it more clear that humans gave input which went outside the bounds of the automatic agents. It doesn't seem like this would occur if it were agents simulating more input.
Ultimately, I think that the paper itself could've been more clear in this regard, but clearly the publishing website tries to be very explicit by saying upfront - "Real-time recordings of people playing the game DOOM" and it's pretty hard to argue against that.
Anyway. I repent! It was a learning experience going back and forth on my belief here. Very cool tech overall.
It's funny how academic writing works. Authors rarely produce many unclear or ambiguous statements where the most likely interpretation undersells their work...
I knew it was too good be true but seems like real time video generation can be good enough to get to a point where it feels like a truly interactive video/game
Imagine if text2game was possible. there would be some sort of network generating each frame from an image generated by text, with some underlying 3d physics simulation to keep all the multiplayer screens sync'd
this paper does not seem to be of that possibility rather some cleverly words to make you think people were playing a real time video. we can't even generate more than 5~10 second of video without it hallucinating. something this persistent would require an extreme amount of gameplay video training. it can be done but the video shown by this paper is not true to its words.
The quest to run doom on everything continues.
Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement?
I just find it funny that on a linear scale of hardware specification, Doom now finds itself on both ends.
>Technically speaking, isn't this the greatest possible anti-Doom
When I read this part I thought you were going to say because you're technically not running Doom at all. That is, instead of running Doom without Doom's original hardware/software environment (by porting it), you're running Doom without Doom itself.
> 860M UNet and CLIP ViT-L/14 (540M)
Checkpoint size:
4.27 Gb
7.7 GB (full EMA)
Running on a TPU-v5e
Peak compute per chip (bf16) 197 TFLOPs
Peak compute per chip (Int8) 393 TFLOPs
HBM2 capacity and bandwidth 16 GB, 819 GBps
Interchip Interconnect BW 1600 Gbps
This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.
What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).
I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.
Those are valid points, but irrelevant for the context of this research.
Yes, the computational cost is ridicolous compared to the original game, and yes, it lacks basic things like pre-computing, storing, etc. That said, you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.
The fact that you can model a sequence of frames with such contextual awareness without explictly having to encode it, is the real breakthrough here. Both from a pure gaming standpoint, but on simulation in general.
I suppose it also doesn't really matter what kinds of resources the game originally requires. The diffusion model isn't going to require twice as much memory just because the game does. Presumably you wouldn't even necessarily need to be able to render the original game in real time - I would imagine the basic technique would work even if you used a state of the Hollywood-quality offline renderer to render each input frame, and that the performance of the diffusion model would be similar?
>you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.
OR one can hope it will be thrown to the heap of nonviable tech with the rest of spam waste
1) the model has enough memory to store not only all game assets and engine but even hundreds of "plays".
2) me mentioning that there's still a lot of room to make these things better (seems you think so too so maybe not this one?)
3) an interesting point I was wondering to compare current state of things (I mean I'll give you this but it's just a random thought and I'm not reviewing this paper in an academic setting. This is HN, not NeurIPS. I'm just curious ¯ \ _ ( ツ ) _ / ¯)
4) the point that you can rip a game
I'm really not sure what you're contesting to because I said several things.
> it lacks basic things like pre-computing, storing, etc.
It does? Last I checked neural nets store information. I guess I need to return my PhD because last I checked there's a UNet in SD 1.4 and that contains a decoder.
>What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute
That's the least of it. It means you can generate a game from real footage. Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.
> Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.
You're jumping ahead there and I'm not convinced you could do this ever (unless you're model is already a great physics engine). The paper itself has feeds the controls into the network. But a flight sim will be harder better you'd need to also feed in air conditions. I just don't see how you could do this from video alone, let alone just video from the cockpit. Humans could not do this. There's just not enough information.
Plus, presumably, either training it on pilot inputs (and being able to map those to joystick inputs and mouse clicks) or having the user have an identical fake cockpit to play in and a camera to pick up their movements.
And, unless you wanted a simulator that only allowed perfectly normal flight, you'd have to have those airliners go through every possible situation that you wanted to reproduce: warnings, malfunctions, emergencies, pilots pushing the airliner out of its normal flight envelope, etc.
The possibility seems far beyond gaming(given enough computation resources).
You can feed it with videos of usage of any software or real world footage recorded by a Go Pro mounted on your shoulder(with body motion measured by some sesnors though the action space would be much larger).
Such a "game engine" can potentially be used as a simulation gym environment to train RL agents.
wouldnt make more sense to train using microsoft flight simulator the same way they did DOOM, but im not sure what the point is if the game already exists
It's always fun reading the dead comments on a post like this. People love to point how how pointless this is.
Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.
Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.
Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.
The skepticism and criticism in this thread is against the hype of AI, it's implied by people saying "this is so amazing" that they think that in some near future you can create any video game experience you can imagine by just replacing all the software with some AI models, rendering the whole game.
When in reality this is the least efficient and reliable form of Doom yet created, using literally millions of times the computation used by the first x86 PCs that were able to render and play doom in real-time.
Yes it's less efficient and reliable than regular Doom, but it's not just limited to Doom. You could have it simulate a very graphically advanced game that barely runs on current hardware and it would run at the exact same speed as Doom.
So true. The hustle culture is an spreading disease that has replaced the fun maker culture from the 80s/90s.
It's unavoidable though. Cost of living being increasingly expensive and romantization of entrepreneurs like they are rock stars leads towards this hustle mindset.
Today this exercise feels pointless. However. I remember the days when there were articles written about the possibility for "internet radio". In stead of good old broadcasting waves through the ether and simply thousands of radios tuning in, some server would send a massive amount of packets over a massive kilometers of copper to thousands of endpoints. Ad absurdum the endpoints would even send ack packages upstream to the poor server to keep connections alive. It seemed like a huge waste of computing power, wire and energy.
And here we are, binging netflix movies over such copper wires.
I'm not saying games will be replaced by diffusion models dreaming up next images based on user input, but a variation of that might end up in a form of interactive art creation or a new form of entertainment.
> This is a stepping stone for generating entire novel games.
I don't see how.
This game "engine" is purely mapping [pixels, input] -> new pixels. It has no notion of game state (so you can kill an enemy, turn your back, then turn around again, and the enemy could be alive again), not to mention that it requires the game to already exist in order to train it.
I suppose, in theory, you could train the network to include game state in the input and output, or potentially even handle game state outside the network entirely and just make it one of the inputs, but the output would be incredibly noisy and nigh unplayable.
And like I said, all of it requires the game to already exist in order to train the network.
Although impressive i must disagree. Diffusion models are not game engines. A game engine is a component to propell your game (along the time axis?). In that sense it is similar to the engine of the car, hence the name. It does not need a single working car nor a road to drive on do its job.
The above is a dynamic, interactive replication of what happens when you put a car on a given road, requiring a million test drives with working vehicles. An engine would also work offroad.
This seems more a critique on the particular model (as in the resulting diffusion model generated), not on diffusion models in general. It's also a bit misstated - this doesn't require a working car on the road to do its job (present tense), it required one to train it to do its job (past tense) and it's not particularly clear why a game engine using concepts gained from how another worked should cease to be a game engine. For diffusion models in general and not the specifically trained example here I don't see why one would assume the approach can't also work outside of the particular "test tracks" it was trained on, just as a typical diffusion model works on more than generating the exact images it was trained on (can interpolate and apply individual concepts to create a novel output).
my point is something else: a game engine is something which can be separated from a game and put to use somewhere else. this is basically the definition of „engine“. the above is not an engine but a game without any engine at all therefor should not be called „engine“.
In a way this is a "simulated game engine", trained from actual game engine data. But I would argue a working simulated game engine becomes a game engine of its own, as it is then able to "propell the game" as you say. The way it achieves this becomes irrelevant, in one case the content was crafted by humans, in the other case it mimics existing game content, the player really doesn't care!
> An engine would also work offroad.
Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places. I'd even say extrapolation capabilities of such a model could be better than a traditional game engine, as it can make things up as it goes, while if you accidentally cross a wall in a typical game engine the screen goes blank.
The game doom is more than a game engine, isnt it? I‘d be okay with calling the above a „simulated game“ or a „game“. My point is: let‘s not conflate the idea of a „game engine“ which is a construct of intellectual concepts put together to create a simulation of „things happening in time“ and deriving output (audio and visual). the engine is fed with input and data (levels and other assets) and then drives(EDIT) a „game“.
training the model with a final game will never give you an engine. maybe a „simulated game“ or even a „game“ but certainly not an „engine“. the latter would mean the model would be capable to derive and extract the technical and intellectual concepts and apply them elsewhere.
> Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places.
They easily could have demonstrated this by seeding the model with images of Doom maps which weren't in the training set, but they chose not to. I'm sure they tried it and the results just weren't good, probably morphing the map into one of the ones it was trained on at the first opportunity.
There is no text conditioning provided to the SD model because they removed it, but one can imagine a near future where text prompts are enough to create a fun new game!
Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.
IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.
With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.
At which point, you effectively would be interpolating in latent space through the source code to actually "render" the game. You'd have an entire latent space computer, with an engine, assets, textures, a software renderer.
With a sufficiently powerful computer, one could imagine what interpolating in this latent space between, say Factorio and TF2 (2 of my favorites). And tweaking this latent space to your liking by conditioning it on any number of gameplay aspects.
This future comes very quickly for subsets of the pipeline, like the very end stage of rendering -- DLSS is already in production, for example. Maybe Nvidia's revenue wraps back to gaming once again, as we all become bolted into a neural metaverse.
> With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.
Neural nets are not guaranteed to converge to anything even remotely optimal, so no that isn't how it works. Also even though neural nets can approximate any function they usually can't do it in a time or space efficient manner, resulting in much larger programs than the human written code.
The Holographic Principle is the idea that our universe is a projection of a higher dimensional space, which sounds an awful lot like the total simulation of an interactive environment, encoded in the parameter space of a neural network.
The first thing I thought when I saw this was: couldn't my immediate experience be exactly the same thing? Including the illusion of a separate main character to whom events are occurring?
Similarly, you could run a very very simple game engine, that outputs little more than a low resolution wireframe, and upscale it. Put all of the effort into game mechanics and none into visual quality.
I would expect something in this realm to be a little better at not being visually inconsistent when you look away and look back. A red monster turning into a blue friendly etc.
If it can be trained on (many) existing games, then it might work similarly to how you don't need to describe every possible detail of a generated image in order to get something that looks like what you're asking for (and looks like a plausible image for the underspecified parts).
Video games are gonna be wild in the near future. You could have one person talking to a model producing something that's on par with a AAA title from today. Imagine the 2d sidescroller boom on Steam but with immersive photorealistic 3d games with hyper-realistic physics (water flow, fire that spreads, tornados) and full deformability and buildability because the model is pretrained with real world videos. Your game is just a "style" that tweaks some priors on look, settings, and story.
Sorry, no offence, but you sound like those EA execs wearing expensive suits and never played a single video game in their entire life. There’s a great documentary on how Half Life was made. Gabe Newell was interviewed by someone asking “why you did that and this, it’s not realistic”, where he answered “because it’s more fun this way, you want realism — just go outside”.
Most games are conditioned on text, it's just that we call it "source code" :).
(Jk of course I know what you mean, but you can seriously see text prompts as compressed forms of programming that leverage the model's prior knowledge)
The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.
That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.
Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.
Anyway, a fun idea that worked! Love those.
To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).
This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.
[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...
If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.
What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?
A dialogue between two parties with different functionality so to speak.
(Non technical person here - just fantasizing)
If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.
or do we believe it's an inherent limitation in the approach?
The diffusion model doesn’t maintain any state itself, though its weights may encode some notion of cause/effect. It just renders one frame at a time (after all it’s a text to image model, not text to video). Instead of text, the previous states and frames are provided as inputs to the model to predict the next frame.
Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.
De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.
In this case it helps prevent auto-regressive drift due to the accumulation of small errors from the randomness inherent in generative diffusion models. Figure 4 shows such drift happening when a player is standing still.
The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.
Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.
But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.
The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.
It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.
I would call it the world's least efficient video compression.
What I would like to see is the actual predictive strength, aka imagination, which I did not notice mentioned in the abstract. The model is trained on a set of classic maps. What would it do, given a few frames of gameplay on an unfamiliar map as input? How well could it imagine what happens next?
It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.
So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.
A mistake people make all the time is that massive companies will put all their resources toward every project. This paper was written by four co-authors. They probably got a good amount of resources, but they still had to share in the pool allocated to their research department.
Even Google only has one Gemini (in a few versions).
Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).
PS: Not great examples, but I hope you get the idea ;)
Deleted Comment
You didn't say open _what_ models. Was that intentional?
Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.
As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.
If that is the case, what does aphantasia tell us?
[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...
[1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett
[2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s
Yup, see https://en.wikipedia.org/wiki/Predictive_coding
It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...)
It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.
Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast.
This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.
It’s just the stochastic parrot argument again.
This is an incredibly complex hypothesis that doesn't really seem justified by the evidence
It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.
There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.
> A is the set of key presses and mouse movements…
> …to condition on actions, we simply learn an embedding A_emb for each action
So, it’s clear that in this model the diffusion process is conditioned by embedding A that is derived from user actions rather than words.
Then a noised start frame is encoded into latents and concatenated on to the noise latents as a second conditioning.
So we have a diffusion model which is trained solely on images of doom, and which is conditioned on current doom frames and user actions to produce subsequent frames.
So yes, the users are playing it.
However, it should be unsurprising that this is possible. This is effectively just a neural recording of the game. But it’s a cool tech demo.
Since the splats are specifically designed for rendering it seems like it would be an efficient way for the image model to learn the geometry without having to encode it on the image model itself.
https://www.youtube.com/watch?v=udPY5rQVoW0 "Playing a Neural Network's version of GTA V: GAN Theft Auto"
> Figure 1: a human player is playing DOOM on GameNGen at 20 FPS.
The abstract is ambiguously worded which has caused a lot of confusion here, but the paper is unmistakably clear about this point.
Kind of disappointing to see this misinformation upvoted so highly on a forum full of tech experts.
Well you're wrong as specified in the first video and by the authors themselves, maybe next time check better instead of writing comments with such authoritative tone of things you don't actually know.
The people surveyed in this study are not playing the game, they are watching extremely short video clips of the game being played and comparing them to equally short videos of the original Doom being played, to see if they can spot the difference.
I may be wrong with how it works, but I think this is just hallucinating in real time. It has no internal state per se, it knows what was on screen in the previous few frames and it knows what inputs the user is pressing, and so it generates the next frame. Like with video compression, it probably doesn't need to generate a full frame every time, just "differences".
As with all the previous AI game research, these are not games in any real sense. They fall apart when played beyond any meaningful length of time (seconds). Crucially, they are not playable by anyone other than the developers in very controlled settings. A defining attribute of any game is that it can be played.
I would've really liked to see a section of the paper explicitly call out that they used humans in real time. There's a lot of sentences that led me to believe otherwise. It's clear that they used a bunch of agents to simulate gameplay where those agents submitted user inputs to affect the gameplay and they captured those inputs in their model. This made it a bit murky as to whether humans ever actually got involved.
This statement, "Our end goal is to have human players interact with our simulation. To that end, the policy π as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play"
led me to believe that while they had an ultimate goal of user input (why wouldn't they) they sufficed by approximating human input.
I was looking to refute that assumption later in the paper by hopefully reading some words on the human gameplay experience, but instead, under Results, I found:
"Human Evaluation. As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively)."
and it's like.. okay.. if you have a section in results on human evaluation, and your goal is to have humans play, then why are you talking just about humans reviewing video rather than giving some sort of feedback on the human gameplay experience - even if it's not especially positive?
Still, in the Discussion section, it mentions, "The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases." which makes it more clear that humans gave input which went outside the bounds of the automatic agents. It doesn't seem like this would occur if it were agents simulating more input.
Ultimately, I think that the paper itself could've been more clear in this regard, but clearly the publishing website tries to be very explicit by saying upfront - "Real-time recordings of people playing the game DOOM" and it's pretty hard to argue against that.
Anyway. I repent! It was a learning experience going back and forth on my belief here. Very cool tech overall.
Imagine if text2game was possible. there would be some sort of network generating each frame from an image generated by text, with some underlying 3d physics simulation to keep all the multiplayer screens sync'd
this paper does not seem to be of that possibility rather some cleverly words to make you think people were playing a real time video. we can't even generate more than 5~10 second of video without it hallucinating. something this persistent would require an extreme amount of gameplay video training. it can be done but the video shown by this paper is not true to its words.
When I read this part I thought you were going to say because you're technically not running Doom at all. That is, instead of running Doom without Doom's original hardware/software environment (by porting it), you're running Doom without Doom itself.
Isn't that possible by setting arbitrarily high goals for ray-cast rendering?
Not really? The greatest anti-Doom would be an infinite nest of these types of models predicting models predicting Doom at the very end of the chain.
The next step of anti-Doom would be a model generating the model, generating the Doom output.
What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).
I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.
- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...
- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...
- https://cloud.google.com/tpu/docs/v5e
- https://github.com/Farama-Foundation/ViZDoom
- https://zdoom.org/index
Yes, the computational cost is ridicolous compared to the original game, and yes, it lacks basic things like pre-computing, storing, etc. That said, you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.
The fact that you can model a sequence of frames with such contextual awareness without explictly having to encode it, is the real breakthrough here. Both from a pure gaming standpoint, but on simulation in general.
OR one can hope it will be thrown to the heap of nonviable tech with the rest of spam waste
1) the model has enough memory to store not only all game assets and engine but even hundreds of "plays".
2) me mentioning that there's still a lot of room to make these things better (seems you think so too so maybe not this one?)
3) an interesting point I was wondering to compare current state of things (I mean I'll give you this but it's just a random thought and I'm not reviewing this paper in an academic setting. This is HN, not NeurIPS. I'm just curious ¯ \ _ ( ツ ) _ / ¯)
4) the point that you can rip a game
I'm really not sure what you're contesting to because I said several things.
It does? Last I checked neural nets store information. I guess I need to return my PhD because last I checked there's a UNet in SD 1.4 and that contains a decoder.That's the least of it. It means you can generate a game from real footage. Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.
I guess that's the occasion to remind that ML is splendid at interpolating, but extrapolating, maybe don't keep your hopes too high.
Namely, to have a "perfect flight sim" using GoPros, you'll need to record hundreds of stalls and crashs.
And, unless you wanted a simulator that only allowed perfectly normal flight, you'd have to have those airliners go through every possible situation that you wanted to reproduce: warnings, malfunctions, emergencies, pilots pushing the airliner out of its normal flight envelope, etc.
You can feed it with videos of usage of any software or real world footage recorded by a Go Pro mounted on your shoulder(with body motion measured by some sesnors though the action space would be much larger).
Such a "game engine" can potentially be used as a simulation gym environment to train RL agents.
Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.
Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.
Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.
When in reality this is the least efficient and reliable form of Doom yet created, using literally millions of times the computation used by the first x86 PCs that were able to render and play doom in real-time.
But it's a funny party trick, sure.
It's unavoidable though. Cost of living being increasingly expensive and romantization of entrepreneurs like they are rock stars leads towards this hustle mindset.
And here we are, binging netflix movies over such copper wires.
I'm not saying games will be replaced by diffusion models dreaming up next images based on user input, but a variation of that might end up in a form of interactive art creation or a new form of entertainment.
I don't see how.
This game "engine" is purely mapping [pixels, input] -> new pixels. It has no notion of game state (so you can kill an enemy, turn your back, then turn around again, and the enemy could be alive again), not to mention that it requires the game to already exist in order to train it.
I suppose, in theory, you could train the network to include game state in the input and output, or potentially even handle game state outside the network entirely and just make it one of the inputs, but the output would be incredibly noisy and nigh unplayable.
And like I said, all of it requires the game to already exist in order to train the network.
In a way this is a "simulated game engine", trained from actual game engine data. But I would argue a working simulated game engine becomes a game engine of its own, as it is then able to "propell the game" as you say. The way it achieves this becomes irrelevant, in one case the content was crafted by humans, in the other case it mimics existing game content, the player really doesn't care!
> An engine would also work offroad.
Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places. I'd even say extrapolation capabilities of such a model could be better than a traditional game engine, as it can make things up as it goes, while if you accidentally cross a wall in a typical game engine the screen goes blank.
training the model with a final game will never give you an engine. maybe a „simulated game“ or even a „game“ but certainly not an „engine“. the latter would mean the model would be capable to derive and extract the technical and intellectual concepts and apply them elsewhere.
They easily could have demonstrated this by seeding the model with images of Doom maps which weren't in the training set, but they chose not to. I'm sure they tried it and the results just weren't good, probably morphing the map into one of the ones it was trained on at the first opportunity.
Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.
IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.
At which point, you effectively would be interpolating in latent space through the source code to actually "render" the game. You'd have an entire latent space computer, with an engine, assets, textures, a software renderer.
With a sufficiently powerful computer, one could imagine what interpolating in this latent space between, say Factorio and TF2 (2 of my favorites). And tweaking this latent space to your liking by conditioning it on any number of gameplay aspects.
This future comes very quickly for subsets of the pipeline, like the very end stage of rendering -- DLSS is already in production, for example. Maybe Nvidia's revenue wraps back to gaming once again, as we all become bolted into a neural metaverse.
God I love that they chose DOOM.
Neural nets are not guaranteed to converge to anything even remotely optimal, so no that isn't how it works. Also even though neural nets can approximate any function they usually can't do it in a time or space efficient manner, resulting in much larger programs than the human written code.
https://news.ycombinator.com/item?id=41377398
^__^The first thing I thought when I saw this was: couldn't my immediate experience be exactly the same thing? Including the illusion of a separate main character to whom events are occurring?
I would expect something in this realm to be a little better at not being visually inconsistent when you look away and look back. A red monster turning into a blue friendly etc.
Sit down and write down a text prompt for a "fun new game". You can start with something relatively simple like a Mario-like platformer.
By page 300, when you're about halfway through describing what you mean, you might understand why this is wishful thinking
Not really. This is a reproduction of the first level of Doom. Nothing original is being created.
(Jk of course I know what you mean, but you can seriously see text prompts as compressed forms of programming that leverage the model's prior knowledge)
Deleted Comment