Oasis: A Universe in a Transformer

redblacktree · 10 months ago

"If you were dreaming in Minecraft" is the impression that I get. It feels very much like a dream with the lack of object permanence. Also interesting is light level. If you stare at something dark for a while or go "underwater" and get to the point where the screen is black, it's difficult to get back to anything but a black screen. (I didn't manage it in my one playthrough)

Very odd sensation indeed.

robotresearcher · 10 months ago

I don't see how you design and ship a game like this. You can't design a game by setting model weights directly. I do see how you might clone a game, eventually without all the missing things like object permanence and other long-term state. But the inference engine is probably more expensive to run than the game engine it (somewhat) emulates.

What is this tech useful for? Genuine question from a long-time AI person.

naed90 · 10 months ago

Yep! Which is why a key point for our next models is to get to a state that you can "code" a new world using "prompting". I agree that these tools become insanely useful only once there is a very good way for creators to "develop" new worlds/games on top of these systems and then users could interact with those worlds.

At the end of the day, it should provide the same "API" as a game engine does: creators develop worlds, users interact with those worlds. The nice thing is that if AI can actually fill this role, then it would be: 1. Potentially much easier to create worlds/games (you could just "talk" to the AI -- "add a flying pink elephant here") 2. Users could interact with a world that could change to fit each game session -- this is truly infinite worlds

Last point: are we there yet? Ofc not! Oasis v1 is a first POC. Wait just a bit more for v2 ;)

notfed · 10 months ago

Obviously this tool is not going to generate a "ship"pable game for you. AI is a long way off from that. As for "design", I don't find it very hard to see how incredibly useful being able to rapidly prototype a game would be, even if it requires massive GPU usage. And papers like these are only stepping stones to getting there.

robotresearcher · 10 months ago

> being able to rapidly prototype a game would be

I don't see how this does that, or is a step towards that. Help me see it?

Deleted Comment

sangnoir · 10 months ago

I found the visual artifacts annoying. I wonder if anyone has trained models on pre-rasterizarion game engine output like meshes/material, camera or even just the raw OpenGL calls. An AI that generates inputs to an actual renderer/engine will solve visual fidelity

stale2002 · 10 months ago

> I don't see how you design and ship a game like this. You can't design a game by setting model weights directly. I do see how you might clone a game

Easy. You do the same thing that we did with AI images, except with video game world models.

IE, you combine together multiple of them, and taking bits and pieces of each game "world model", but put together, is almost like creating an entirely new game.

> eventually without all the missing things like object permanence and other long-term state.

Well, just add in those other things with a much smaller set of variables. You are already sending in the whole previous frame, plus user input into the weights. I see no reason why you couldn't send in a simplified game state as well.

andoando · 10 months ago

I suppose you could potentially take a movie like avatar and create a somewhat interactive experience with it?

Deleted Comment

reissbaker · 10 months ago

I think it's an interesting tech demo. You're right that as-is it's not useful. Here are some long-term things I could imagine:

1. Scale it up so that it has a longer context length than a single frame. If it could observe the last million frames, for example, that would allow significantly more temporal consistency.

2. RAG-style approaches. Generate a simple room-by-room level map (basically just empty bounding boxes), and allow the model to read the map as part of its input rather than simply looking at frames. And when your character is in a bounding box the model has generated before, give N frames of that generation as context for the current frame generation (perhaps specifically the frames with the same camera direction, even, or the closest to that camera direction). That would probably result in near-perfect temporal consistency even over very long generations and timeframes, assuming the frame context length was long enough.

3. Train on larger numbers of games, and text modalities, so that you can describe a desired game and get something similar to your description (instead of needing to train on a zillion Minecraft runs just to get... Minecraft.)

That being said I think in the near-term it'll be much more fruitful to generate game assets and put them in a traditional game engine — or generate assets, and have an LLM generate code to place them in an engine — rather than trying to end-to-end go from keyboard+mouse input to video frames without anything structured in between.

Eventually the end-to-end model will probably win unless scaling limits get hit, as per the Bitter Lesson [1], but that's a long eventually, and TBH at that scale there really may just be fundamental scaling issues compared to assets+code approaches.

It's still pretty cool though! And seems useful from a research perspective to show what can already be done at the current scale. And there's no guarantee the scaling limits will exist such that this will be impossible; betting against scaling LLMs during the gpt-2 era would've been a bad bet. Games in particular are very nice to train against, since you have easy access to near-infinite ground truth data, synthetically, since you can just run the game on a computer. I think you could also probably do some very clever things to get better-than-an-engine results by training on real-life video as well, with faked keyboard/mouse inputs, such that eventually you'd just be better both in terms of graphics and physics than a game engine could ever hope to be.

1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

thrance · 10 months ago

> It's a video game, but entirely generated by AI

I ctrl-F'ed the webpage and saw 0 occurrence of "Minecraft". Why? This isn't a video game, this is a poor copy of a real video game you didn't even bother to say the name of, let alone credit it.

armchairhacker · 10 months ago

It seems like an interesting attempt to get around legal issues.

They can't say "Minecraft" because that's a Microsoft trademark, but they can use Minecraft images as training data, because everyone (including Microsoft) is using proprietary data to train diffusion models. There's the issue that the outputs obviously resemble Minecraft, but Microsoft has its own problems with Bing and DALL-E generating images that obviously resemble trademarked things (despite guardrails).

chefandy · 10 months ago

Avoiding legal and ethical issues. This stuff was made by a bunch of real people, and people still get their name in movie and game credits even if they got a paycheck from it. Microsoft shamelessly vacuuming up proprietary content didn't change the norms of the way people get credited in these mediums. It's sad to see how thoughtlessly so many people using generative AI disregard the models' source material as "data" while the models (and therefore their creators) almost always get prominently credited for putting in a tiny fraction of the effort. The dubious ethical defense against crediting source works— that models learn about media the same way humans do and adapt it to suit their purposes— is obliterated when it is trained on one work to reproduce that work. That this is equated to generating an image on Midjourney it's a blatant example of a common practice— people want to get credit for other people's work, but when it's time to take responsibility the way a human artist would have to, "The machine did it! It's not my responsibility!"

amelius · 10 months ago

Maybe they are even __trying__ to get sued to set a legal precedent against Copilot.

kiloDalton · 10 months ago

There is one mention of Minecraft in the second paragraph of the Architecture section, "...We train on a subset of open-source Minecraft video data collected by OpenAI[9]." I can't say whether this was added after your comment.

stared · 10 months ago

It is weird - compare and contrast with https://diamond-wm.github.io/, which explicitly mentions Counter Strike.

When a scientific work uses some work and does not credit it, it is academic dishonesty.

Sure, they could have trained the model on a different dataset. No matter which source was used, it should be cited.

dartos · 10 months ago

Yeah it is strange how they make the model sound like it can generate any environment, but only shows demos of the most data-available game ever.

Deleted Comment

blixt · 10 months ago

Super cool, and really nice to see the continuous rapid progress of these models! I have to wonder how long-term state (building a base and coming back later) as well as potentially guided state (e.g. game rules that are enforced in traditional code, or multiplayer, or loading saved games, etc) will work.

It's probably not by just extending the context window or making the model larger, though that will of course help, because fundamentally external state and memory/simulation are two different things (right?).

Either way it seems natural that these models will soon be used for goal-oriented imagination of a task – e.g. imagine a computer agent that needs to find a particular image on a computer, it would continuously imagine the path between what it currently sees and its desired state, and unlike this model which takes user input, it would imagine that too. In some ways, to the best of my understanding, this already happens with some robot control networks, except without pixels.

aithrowawaycomm · 10 months ago

There's not even the slightest hint of state in this demo: if you hold "turn left" for a full rotation you don't end up where you started. After a few rotations the details disappear and you're left in the middle of a blank ocean. There's no way this tech will ever make a playable version of Mario, let alone Minecraft.

blixt · 10 months ago

There's plenty of evidence of state, just a very short-term memory. Examples:

- The inventory bar is mostly consistent throughout the play

- State transitions in response to key presses

- Block breakage over time is mostly consistent

- Toggling doors / hatches works as expected

- Jumping progresses with correct physics

Turning around and seeing approximately the same thing you saw a minute ago is probably just a matter of extending a context window, but it will inherently have limits when you get to the scale of an entire world even if we somehow can make context windows have excellent compression of redundant data (which would greatly help LLM transformers too). And I guess what I'm mostly wondering about is how would you synchronize this state with a ground truth so that it can be shared between different instances of the agent, or other non-agent entities.

And again, I think it's important to remember games are just great for training this type of technology, but it's probably more useful in non-game fields such as computer automation, robot control, etc.

golol · 10 months ago

Between the first half and the last sentence of your post is a giant leap of conclusion.

GaggiX · 10 months ago

>There's no way this tech will ever make a playable version of Mario

Wait a few months, if someone is willing to use their 4090 to train the model, the technology is already here. If you could play a level of Doom than Mario should be even easier.

duendefm · 10 months ago

It's not a videogame, it's a fast minecraft screenshot simulator where the prompt between each frame is the state of the input and the previous frames, with something of a resemblance of coherence.

jiwidi · 10 months ago

So basically trained a model on minecraft. This is not generalistic at all or whatsoever. Is not like the game comes from a prompt, it probably comes from a bunch of finetuning and gigadatasets from playing minecraft.

Would love to see some work like this but with world/games coming from a prompt.

naed90 · 10 months ago

wait for Oasis v2, coming out soon :) (Disclaimer: I'm from the Oasis team)

whism · 10 months ago

Allow the user to draw into the frame buffer during play and feed that back, and you could have something very interesting.

dartos · 10 months ago

It’d probably break wildly since it’s really hard to draw Minecraft by hand.

brap · 10 months ago

Waiting line is too long so I gave up. Can anyone tell me, are the pixels themselves generated by the model, or does it just generate the environment which is rendered by “classical” means?

yorwba · 10 months ago

If it were to generate an environment rendered by classical means, it would most likely have object permanence instead of regenerating something new after briefly looking away: https://oasis-model.github.io/3_second_memory.webp

naed90 · 10 months ago

Every pixel is generated! User actions go in, pixels come out -- and there is only a transformer in the middle :)

Why is this interesting? Today, not too interesting (Oasis v1 is just a POC). In the future (and by future I literally mean a few months away -- wait for future versions of Oasis coming out soon), imagine that every single pixel you see will be generated, including the pixels you see as you're reading this message. Why is that interesting? It's a new interface for communication between humans and machines. It's like why LLMs are interesting for chat, because they provide humans and machines an ability to interact in a way humans are used to (chat) -- here, computers will be able to see the world as we do and show back stuff to us in a way we are used to. TLDR: imagine telling your computer "create a pink elephant" and just seeing it popup in a game you're playing.

yokto · 10 months ago

It generates the pixels, including the blurry UI at the bottom.