"If you were dreaming in Minecraft" is the impression that I get. It feels very much like a dream with the lack of object permanence. Also interesting is light level. If you stare at something dark for a while or go "underwater" and get to the point where the screen is black, it's difficult to get back to anything but a black screen. (I didn't manage it in my one playthrough)
I don't see how you design and ship a game like this. You can't design a game by setting model weights directly. I do see how you might clone a game, eventually without all the missing things like object permanence and other long-term state. But the inference engine is probably more expensive to run than the game engine it (somewhat) emulates.
What is this tech useful for? Genuine question from a long-time AI person.
Yep! Which is why a key point for our next models is to get to a state that you can "code" a new world using "prompting". I agree that these tools become insanely useful only once there is a very good way for creators to "develop" new worlds/games on top of these systems and then users could interact with those worlds.
At the end of the day, it should provide the same "API" as a game engine does: creators develop worlds, users interact with those worlds. The nice thing is that if AI can actually fill this role, then it would be:
1. Potentially much easier to create worlds/games (you could just "talk" to the AI -- "add a flying pink elephant here")
2. Users could interact with a world that could change to fit each game session -- this is truly infinite worlds
Last point: are we there yet? Ofc not! Oasis v1 is a first POC. Wait just a bit more for v2 ;)
Obviously this tool is not going to generate a "ship"pable game for you. AI is a long way off from that. As for "design", I don't find it very hard to see how incredibly useful being able to rapidly prototype a game would be, even if it requires massive GPU usage. And papers like these are only stepping stones to getting there.
I found the visual artifacts annoying. I wonder if anyone has trained models on pre-rasterizarion game engine output like meshes/material, camera or even just the raw OpenGL calls. An AI that generates inputs to an actual renderer/engine will solve visual fidelity
> I don't see how you design and ship a game like this. You can't design a game by setting model weights directly. I do see how you might clone a game
Easy. You do the same thing that we did with AI images, except with video game world models.
IE, you combine together multiple of them, and taking bits and pieces of each game "world model", but put together, is almost like creating an entirely new game.
> eventually without all the missing things like object permanence and other long-term state.
Well, just add in those other things with a much smaller set of variables. You are already sending in the whole previous frame, plus user input into the weights. I see no reason why you couldn't send in a simplified game state as well.
I think it's an interesting tech demo. You're right that as-is it's not useful. Here are some long-term things I could imagine:
1. Scale it up so that it has a longer context length than a single frame. If it could observe the last million frames, for example, that would allow significantly more temporal consistency.
2. RAG-style approaches. Generate a simple room-by-room level map (basically just empty bounding boxes), and allow the model to read the map as part of its input rather than simply looking at frames. And when your character is in a bounding box the model has generated before, give N frames of that generation as context for the current frame generation (perhaps specifically the frames with the same camera direction, even, or the closest to that camera direction). That would probably result in near-perfect temporal consistency even over very long generations and timeframes, assuming the frame context length was long enough.
3. Train on larger numbers of games, and text modalities, so that you can describe a desired game and get something similar to your description (instead of needing to train on a zillion Minecraft runs just to get... Minecraft.)
That being said I think in the near-term it'll be much more fruitful to generate game assets and put them in a traditional game engine — or generate assets, and have an LLM generate code to place them in an engine — rather than trying to end-to-end go from keyboard+mouse input to video frames without anything structured in between.
Eventually the end-to-end model will probably win unless scaling limits get hit, as per the Bitter Lesson [1], but that's a long eventually, and TBH at that scale there really may just be fundamental scaling issues compared to assets+code approaches.
It's still pretty cool though! And seems useful from a research perspective to show what can already be done at the current scale. And there's no guarantee the scaling limits will exist such that this will be impossible; betting against scaling LLMs during the gpt-2 era would've been a bad bet. Games in particular are very nice to train against, since you have easy access to near-infinite ground truth data, synthetically, since you can just run the game on a computer. I think you could also probably do some very clever things to get better-than-an-engine results by training on real-life video as well, with faked keyboard/mouse inputs, such that eventually you'd just be better both in terms of graphics and physics than a game engine could ever hope to be.
I ctrl-F'ed the webpage and saw 0 occurrence of "Minecraft". Why? This isn't a video game, this is a poor copy of a real video game you didn't even bother to say the name of, let alone credit it.
It seems like an interesting attempt to get around legal issues.
They can't say "Minecraft" because that's a Microsoft trademark, but they can use Minecraft images as training data, because everyone (including Microsoft) is using proprietary data to train diffusion models. There's the issue that the outputs obviously resemble Minecraft, but Microsoft has its own problems with Bing and DALL-E generating images that obviously resemble trademarked things (despite guardrails).
Avoiding legal and ethical issues. This stuff was made by a bunch of real people, and people still get their name in movie and game credits even if they got a paycheck from it. Microsoft shamelessly vacuuming up proprietary content didn't change the norms of the way people get credited in these mediums. It's sad to see how thoughtlessly so many people using generative AI disregard the models' source material as "data" while the models (and therefore their creators) almost always get prominently credited for putting in a tiny fraction of the effort. The dubious ethical defense against crediting source works— that models learn about media the same way humans do and adapt it to suit their purposes— is obliterated when it is trained on one work to reproduce that work. That this is equated to generating an image on Midjourney it's a blatant example of a common practice— people want to get credit for other people's work, but when it's time to take responsibility the way a human artist would have to, "The machine did it! It's not my responsibility!"
There is one mention of Minecraft in the second paragraph of the Architecture section, "...We train on a subset of open-source Minecraft video data collected by OpenAI[9]." I can't say whether this was added after your comment.
Super cool, and really nice to see the continuous rapid progress of these models! I have to wonder how long-term state (building a base and coming back later) as well as potentially guided state (e.g. game rules that are enforced in traditional code, or multiplayer, or loading saved games, etc) will work.
It's probably not by just extending the context window or making the model larger, though that will of course help, because fundamentally external state and memory/simulation are two different things (right?).
Either way it seems natural that these models will soon be used for goal-oriented imagination of a task – e.g. imagine a computer agent that needs to find a particular image on a computer, it would continuously imagine the path between what it currently sees and its desired state, and unlike this model which takes user input, it would imagine that too. In some ways, to the best of my understanding, this already happens with some robot control networks, except without pixels.
There's not even the slightest hint of state in this demo: if you hold "turn left" for a full rotation you don't end up where you started. After a few rotations the details disappear and you're left in the middle of a blank ocean. There's no way this tech will ever make a playable version of Mario, let alone Minecraft.
There's plenty of evidence of state, just a very short-term memory. Examples:
- The inventory bar is mostly consistent throughout the play
- State transitions in response to key presses
- Block breakage over time is mostly consistent
- Toggling doors / hatches works as expected
- Jumping progresses with correct physics
Turning around and seeing approximately the same thing you saw a minute ago is probably just a matter of extending a context window, but it will inherently have limits when you get to the scale of an entire world even if we somehow can make context windows have excellent compression of redundant data (which would greatly help LLM transformers too). And I guess what I'm mostly wondering about is how would you synchronize this state with a ground truth so that it can be shared between different instances of the agent, or other non-agent entities.
And again, I think it's important to remember games are just great for training this type of technology, but it's probably more useful in non-game fields such as computer automation, robot control, etc.
>There's no way this tech will ever make a playable version of Mario
Wait a few months, if someone is willing to use their 4090 to train the model, the technology is already here. If you could play a level of Doom than Mario should be even easier.
It's not a videogame, it's a fast minecraft screenshot simulator where the prompt between each frame is the state of the input and the previous frames, with something of a resemblance of coherence.
So basically trained a model on minecraft. This is not generalistic at all or whatsoever. Is not like the game comes from a prompt, it probably comes from a bunch of finetuning and gigadatasets from playing minecraft.
Would love to see some work like this but with world/games coming from a prompt.
Waiting line is too long so I gave up. Can anyone tell me, are the pixels themselves generated by the model, or does it just generate the environment which is rendered by “classical” means?
If it were to generate an environment rendered by classical means, it would most likely have object permanence instead of regenerating something new after briefly looking away: https://oasis-model.github.io/3_second_memory.webp
Every pixel is generated! User actions go in, pixels come out -- and there is only a transformer in the middle :)
Why is this interesting? Today, not too interesting (Oasis v1 is just a POC). In the future (and by future I literally mean a few months away -- wait for future versions of Oasis coming out soon), imagine that every single pixel you see will be generated, including the pixels you see as you're reading this message. Why is that interesting? It's a new interface for communication between humans and machines. It's like why LLMs are interesting for chat, because they provide humans and machines an ability to interact in a way humans are used to (chat) -- here, computers will be able to see the world as we do and show back stuff to us in a way we are used to. TLDR: imagine telling your computer "create a pink elephant" and just seeing it popup in a game you're playing.
Very odd sensation indeed.
What is this tech useful for? Genuine question from a long-time AI person.
At the end of the day, it should provide the same "API" as a game engine does: creators develop worlds, users interact with those worlds. The nice thing is that if AI can actually fill this role, then it would be: 1. Potentially much easier to create worlds/games (you could just "talk" to the AI -- "add a flying pink elephant here") 2. Users could interact with a world that could change to fit each game session -- this is truly infinite worlds
Last point: are we there yet? Ofc not! Oasis v1 is a first POC. Wait just a bit more for v2 ;)
I don't see how this does that, or is a step towards that. Help me see it?
Deleted Comment
Easy. You do the same thing that we did with AI images, except with video game world models.
IE, you combine together multiple of them, and taking bits and pieces of each game "world model", but put together, is almost like creating an entirely new game.
> eventually without all the missing things like object permanence and other long-term state.
Well, just add in those other things with a much smaller set of variables. You are already sending in the whole previous frame, plus user input into the weights. I see no reason why you couldn't send in a simplified game state as well.
Deleted Comment
1. Scale it up so that it has a longer context length than a single frame. If it could observe the last million frames, for example, that would allow significantly more temporal consistency.
2. RAG-style approaches. Generate a simple room-by-room level map (basically just empty bounding boxes), and allow the model to read the map as part of its input rather than simply looking at frames. And when your character is in a bounding box the model has generated before, give N frames of that generation as context for the current frame generation (perhaps specifically the frames with the same camera direction, even, or the closest to that camera direction). That would probably result in near-perfect temporal consistency even over very long generations and timeframes, assuming the frame context length was long enough.
3. Train on larger numbers of games, and text modalities, so that you can describe a desired game and get something similar to your description (instead of needing to train on a zillion Minecraft runs just to get... Minecraft.)
That being said I think in the near-term it'll be much more fruitful to generate game assets and put them in a traditional game engine — or generate assets, and have an LLM generate code to place them in an engine — rather than trying to end-to-end go from keyboard+mouse input to video frames without anything structured in between.
Eventually the end-to-end model will probably win unless scaling limits get hit, as per the Bitter Lesson [1], but that's a long eventually, and TBH at that scale there really may just be fundamental scaling issues compared to assets+code approaches.
It's still pretty cool though! And seems useful from a research perspective to show what can already be done at the current scale. And there's no guarantee the scaling limits will exist such that this will be impossible; betting against scaling LLMs during the gpt-2 era would've been a bad bet. Games in particular are very nice to train against, since you have easy access to near-infinite ground truth data, synthetically, since you can just run the game on a computer. I think you could also probably do some very clever things to get better-than-an-engine results by training on real-life video as well, with faked keyboard/mouse inputs, such that eventually you'd just be better both in terms of graphics and physics than a game engine could ever hope to be.
1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
I ctrl-F'ed the webpage and saw 0 occurrence of "Minecraft". Why? This isn't a video game, this is a poor copy of a real video game you didn't even bother to say the name of, let alone credit it.
They can't say "Minecraft" because that's a Microsoft trademark, but they can use Minecraft images as training data, because everyone (including Microsoft) is using proprietary data to train diffusion models. There's the issue that the outputs obviously resemble Minecraft, but Microsoft has its own problems with Bing and DALL-E generating images that obviously resemble trademarked things (despite guardrails).
When a scientific work uses some work and does not credit it, it is academic dishonesty.
Sure, they could have trained the model on a different dataset. No matter which source was used, it should be cited.
Deleted Comment
It's probably not by just extending the context window or making the model larger, though that will of course help, because fundamentally external state and memory/simulation are two different things (right?).
Either way it seems natural that these models will soon be used for goal-oriented imagination of a task – e.g. imagine a computer agent that needs to find a particular image on a computer, it would continuously imagine the path between what it currently sees and its desired state, and unlike this model which takes user input, it would imagine that too. In some ways, to the best of my understanding, this already happens with some robot control networks, except without pixels.
- The inventory bar is mostly consistent throughout the play
- State transitions in response to key presses
- Block breakage over time is mostly consistent
- Toggling doors / hatches works as expected
- Jumping progresses with correct physics
Turning around and seeing approximately the same thing you saw a minute ago is probably just a matter of extending a context window, but it will inherently have limits when you get to the scale of an entire world even if we somehow can make context windows have excellent compression of redundant data (which would greatly help LLM transformers too). And I guess what I'm mostly wondering about is how would you synchronize this state with a ground truth so that it can be shared between different instances of the agent, or other non-agent entities.
And again, I think it's important to remember games are just great for training this type of technology, but it's probably more useful in non-game fields such as computer automation, robot control, etc.
Wait a few months, if someone is willing to use their 4090 to train the model, the technology is already here. If you could play a level of Doom than Mario should be even easier.
Would love to see some work like this but with world/games coming from a prompt.
Why is this interesting? Today, not too interesting (Oasis v1 is just a POC). In the future (and by future I literally mean a few months away -- wait for future versions of Oasis coming out soon), imagine that every single pixel you see will be generated, including the pixels you see as you're reading this message. Why is that interesting? It's a new interface for communication between humans and machines. It's like why LLMs are interesting for chat, because they provide humans and machines an ability to interact in a way humans are used to (chat) -- here, computers will be able to see the world as we do and show back stuff to us in a way we are used to. TLDR: imagine telling your computer "create a pink elephant" and just seeing it popup in a game you're playing.