- Open Source Diamond WM that you can run on consumer hardware [1]
- Google's Genie 2 (way better than this) [2]
- Oasis [3]
[1] https://diamond-wm.github.io/
[2] https://deepmind.google/discover/blog/genie-2-a-large-scale-...
[3] https://oasis.decart.ai/welcome
There are a lot of papers and demos in this space. They have the same artifacts.
From our perspective, what separates our work is two things:
1. Our model is able to be experienced by anyone today, and in real-time at 30 FPS.
2. Our data domain is real-world, meaning learning life-like pixels and actions. This is, from our perspective, more complex than learning from a video game.
Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel is - is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?
> Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models?
I absolutely think there's going to be super cool startups that accelerate film and game dev as it is today, inside existing 3D engines. Those workflows could be made much faster with generative models.
That said, our belief is that model-imagined experiences are going to become a totally new form of storytelling, and that these experiences might not be free to be as weird and whacky as they could because of heuristics or limitations in existing 3D engines. This is our focus, and why the model is video-in and video-out.
Plus, you've got the very large challenge of learning a rich, high-quality 3D representation from a very small pool of 3D data. The volume of 3D data is just so small, compared to the volumes generative models really need to begin to shine.
> Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel
To be clear, we don't yet know what shape these new experiences will take. I'm hoping we can avoid an awkward initial phase where these experiences resemble traditional game mechanics too much (although we have much to learn from them), and just fast-forward to enabling totally new experiences that just aren't feasible with existing technologies and budgets. Let's see!
> is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?
Yes, exactly. The model just learns better this way (instead of breaking it down into discrete components) and I think the end experience will be weirder and more wonderful for it.
We talk about this challenge in our blog post here (https://odyssey.world/introducing-interactive-video). There's specifics in there on how we improved coherence for this production model, and our work to improve this further with our next-gen model. I'm really proud of our work here!
> Compared to language, image, or video models, world models are still nascent—especially those that run in real-time. One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics. Improving this is an area of research we're deeply invested in.
In second place would absolutely be model optimization to hit real-time. That's a gnarly problem, where you're delicately balancing model intelligence, resolution, and frame-rate.
I got to the graffiti world and there were some stairs right next to me. So I started going up them. It felt like I was walking forward and the stairs were pushing under me until I just got stuck. So I turned to go back down and half way around everything morphed and I ended up back down at the ground level where I originally was. I was teleported. That's why I feel like something is cheating here. If we had mode collapse I'm not sure how we should be able to completely recover our entire environment. Not unless the model is building mini worlds with boundaries. It was like the out of bond teleportation you get in some games but way more fever dream like. That's not what we want from these systems, we don't want to just build a giant poorly compressed videogame, we want continuous generation. If you have mode collapse and recover, it should recover to somewhere new, now where you've been. At least this is what makes me highly suspicious.
To clarify: this is a diffusion model trained on lots of video, that's learning realistic pixels and actions. This model takes in the prior video frame and a user action (e.g. move forward), with the model then generating a new video frame that resembles the intended action. This loop happens every ~40ms, so real-time.
The reason you're seeing similar worlds with this production model is that one of the greatest challenges of world models is maintaining coherence of video over long time periods, especially with diverse pixels (i.e. not a single game). So, to increase reliability for this research preview—meaning multiple minutes of coherent video—we post-trained this model on video from a smaller set of places with dense coverage. With this, we lose generality, but increase coherence.
We share a lot more about this in our blog post here (https://odyssey.world/introducing-interactive-video), and share outputs from a more generalized model.
> One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics.
> To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation.
> To broaden generalization, we’re already making fast progress on our next-generation world model. That model—shown in raw outputs below—is already demonstrating a richer range of pixels, dynamics, and actions, with noticeably stronger generalization.
Let me know any questions. Happy to go deeper!
We think it's a glimpse of a totally new medium of entertainment, where models imagine compelling experiences in real-time and stream them to any screen.
Once you've taken the research preview for a whirl, you can learn a lot more about our technical work behind this here (https://odyssey.world/introducing-interactive-video).