olivercameron (u/olivercameron)

olivercameron commented on AI video you can watch and interact with, in real-time experience.odyssey.world... · Posted by u/olivercameron

olivercameron · 7 months ago

Hi HN, I hope you enjoy our research preview of interactive video!

We think it's a glimpse of a totally new medium of entertainment, where models imagine compelling experiences in real-time and stream them to any screen.

Once you've taken the research preview for a whirl, you can learn a lot more about our technical work behind this here (https://odyssey.world/introducing-interactive-video).

olivercameron commented on AI video you can watch and interact with, in real-time experience.odyssey.world... · Posted by u/olivercameron

echelon · 7 months ago

Odyssey Systems is six months behind way more impressive demos. They're following in the footsteps of this work:

- Open Source Diamond WM that you can run on consumer hardware [1]

- Google's Genie 2 (way better than this) [2]

- Oasis [3]

[1] https://diamond-wm.github.io/

[2] https://deepmind.google/discover/blog/genie-2-a-large-scale-...

[3] https://oasis.decart.ai/welcome

There are a lot of papers and demos in this space. They have the same artifacts.

olivercameron · 7 months ago

All of this is really great work, and I'm excited to see great labs pushing this research forward.

From our perspective, what separates our work is two things:

1. Our model is able to be experienced by anyone today, and in real-time at 30 FPS.

2. Our data domain is real-world, meaning learning life-like pixels and actions. This is, from our perspective, more complex than learning from a video game.

olivercameron commented on AI video you can watch and interact with, in real-time experience.odyssey.world... · Posted by u/olivercameron

jowday · 7 months ago

Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models? If a world model was truly managing to model a manifold of a 3D scene, it should be pretty easy to extract a mesh or SDF from it and drop that into an engine where you could then impose more concrete rules or sanity check the output of the model. Then you could actually model player movement inside of the 3D engine instead of trying to train the world model to accept any kind of player input you might want to do now or in the future.

Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel is - is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?

olivercameron · 7 months ago

Great questions!

> Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models?

I absolutely think there's going to be super cool startups that accelerate film and game dev as it is today, inside existing 3D engines. Those workflows could be made much faster with generative models.

That said, our belief is that model-imagined experiences are going to become a totally new form of storytelling, and that these experiences might not be free to be as weird and whacky as they could because of heuristics or limitations in existing 3D engines. This is our focus, and why the model is video-in and video-out.

Plus, you've got the very large challenge of learning a rich, high-quality 3D representation from a very small pool of 3D data. The volume of 3D data is just so small, compared to the volumes generative models really need to begin to shine.

> Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel

To be clear, we don't yet know what shape these new experiences will take. I'm hoping we can avoid an awkward initial phase where these experiences resemble traditional game mechanics too much (although we have much to learn from them), and just fast-forward to enabling totally new experiences that just aren't feasible with existing technologies and budgets. Let's see!

> is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?

Yes, exactly. The model just learns better this way (instead of breaking it down into discrete components) and I think the end experience will be weirder and more wonderful for it.

olivercameron commented on AI video you can watch and interact with, in real-time experience.odyssey.world... · Posted by u/olivercameron

abe94 · 7 months ago

very cool - what was the hardest part of building this?

olivercameron · 7 months ago

If I had to choose one, I'd easily say maintaining video coherence over long periods of time. The typical failure case of world models that's attempting to generate diverse pixels (i.e. beyond a single video game) is that they degrade to a mush of incoherent pixels after 10-20 seconds of video.

We talk about this challenge in our blog post here (https://odyssey.world/introducing-interactive-video). There's specifics in there on how we improved coherence for this production model, and our work to improve this further with our next-gen model. I'm really proud of our work here!

> Compared to language, image, or video models, world models are still nascent—especially those that run in real-time. One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics. Improving this is an area of research we're deeply invested in.

In second place would absolutely be model optimization to hit real-time. That's a gnarly problem, where you're delicately balancing model intelligence, resolution, and frame-rate.

olivercameron commented on AI video you can watch and interact with, in real-time experience.odyssey.world... · Posted by u/olivercameron

amelius · 7 months ago

Would be more interesting with people in it.

olivercameron · 7 months ago

I agree! Check out outputs from our next-gen world model here(https://odyssey.world/introducing-interactive-video), featuring richer pixels and dynamics.

olivercameron commented on AI video you can watch and interact with, in real-time experience.odyssey.world... · Posted by u/olivercameron

godelski · 7 months ago

That felt so wrong AND someone is cheating here. This felt really suspicious...

I got to the graffiti world and there were some stairs right next to me. So I started going up them. It felt like I was walking forward and the stairs were pushing under me until I just got stuck. So I turned to go back down and half way around everything morphed and I ended up back down at the ground level where I originally was. I was teleported. That's why I feel like something is cheating here. If we had mode collapse I'm not sure how we should be able to completely recover our entire environment. Not unless the model is building mini worlds with boundaries. It was like the out of bond teleportation you get in some games but way more fever dream like. That's not what we want from these systems, we don't want to just build a giant poorly compressed videogame, we want continuous generation. If you have mode collapse and recover, it should recover to somewhere new, now where you've been. At least this is what makes me highly suspicious.

olivercameron · 7 months ago

Hi! CEO of Odyssey here. Thanks for giving this a shot.

To clarify: this is a diffusion model trained on lots of video, that's learning realistic pixels and actions. This model takes in the prior video frame and a user action (e.g. move forward), with the model then generating a new video frame that resembles the intended action. This loop happens every ~40ms, so real-time.

The reason you're seeing similar worlds with this production model is that one of the greatest challenges of world models is maintaining coherence of video over long time periods, especially with diverse pixels (i.e. not a single game). So, to increase reliability for this research preview—meaning multiple minutes of coherent video—we post-trained this model on video from a smaller set of places with dense coverage. With this, we lose generality, but increase coherence.

We share a lot more about this in our blog post here (https://odyssey.world/introducing-interactive-video), and share outputs from a more generalized model.

> One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics.

> To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation.

> To broaden generalization, we’re already making fast progress on our next-generation world model. That model—shown in raw outputs below—is already demonstrating a richer range of pixels, dynamics, and actions, with noticeably stronger generalization.

Let me know any questions. Happy to go deeper!