Marble: A Multimodal World Model

I understand that DeepMind is working on this too: https://deepmind.google/blog/genie-3-a-new-frontier-for-worl...

I wonder how their approaches and results compare?

echelon · a month ago

Genie delivers on-the-fly generated video that responds to user inputs in real time.

Marble renders a static Gaussian Splat asset (like a 3D game engine asset) that you then render in a game engine.

Marble seems useful for lots of use cases - 3D design, online games, etc. You pay the GPU cost to render once, then you can reuse it.

Genie seems revolutionary but expensive af to render and deliver to end users. You never stop paying boatloads of H100 costs (probably several H100s or TPU equivalents per user session) per second.

You could make a VRChat type game with Marble.

You could make a VRChat game with Genie, but only the billionaires could afford to play it.

To be clear, Genie does some remarkably cool things. You can prompt it, "T-Rex tap dancing by" and it'll appear animated in the world. I don't think any other system can do this. But the cost is enormous and it's why we don't have a playable demo.

When the cost of GPU compute comes down, I'm sure we'll all be steaming a Google Stadia like experience of "games" rendered on the fly. Multiplayer, with Hollywood grade visuals. Like playing real time Lord of the Rings or something wild.

Interestingly, there is a model like Google Genie that is open source and available to run on your local Nvidia desktop GPU. It's called DiamondWM [1], and it's a world model trained on FPS gameplay footage. It generates a 10 fps 160x160 image you can play through. Maybe we'll develop better models and faster techniques and the dream of local world models can one day be realized.

[1] https://diamond-wm.github.io/

corimaith · a month ago

Graphics have long reached diminishing returns in gameplay, people aren't going to playing VRChat tomorrow for the same reasons today.

AI can speed up asset development, but that is not a major bottleneck for video games, what matters is the creative game design and backend systems, which existing on the interaction between players and systems is just about as hard as any management role, if not harder.

jtfrench · a month ago

From what I can tell, you can actually export a mesh in (paid) Marble, whereas I haven't seen mesh exports offered in Genie 3 yet (could be wrong though).

Isn't this a Gaussian Splat model?

I work in AI and, to this day, I don't know what they mean by “world” in “world model”.

godelski · a month ago

  > I work in AI and, to this day, I don't know what they mean by “world” in “world model”.

I have a PhD in ML and a B.S. in physics. What people in ML call a "world model" seems incredibly strange to me. With my physics hat on, a "world model" is pretty clear. It is "a physics." Mind you, there is not one physics, there are competing models and we're just at a point of time that models have converged up to quantum and gravity.

But "a physics" can be a model that describes any world, not just the one we live in. For ML models, this should be based on the data they're processing. Ideally we'd want this to be similar to our own, but if it is modeling a "world" where pi=3, then that's still "a physics".

The key points here are that a physics is a counterfactual description of the environment. You have to have language to formalize relationships between objects. In standard physics (and most of science) we use math[0], though we use several languages (different algebras, different groups, different branches, etc) of math to describe different phenomena. But the point is that an equation is designed to be the maximum compression of that description. I don't really care if you use numbers or symbols, what matters is if you have counterfactual, testable, consistent, and concise descriptions of "the world".

Oddly enough, there are a lot of physicists and former physicists that work in ML but it is fairly uncommon for them to be working on "world modeling." I can tell you from my own experience talking to people who research world models that they respond to my concerns as "we just care if it works" as if that is also not my primary concern. Who the fuck isn't concerned with that? Philosophers?

[0] It can be easy to read "The Unreasonable Effectiveness of Mathematics in the Natural Sciences" out of context as we're >60 years past where math has been the lingua Franca of science. But math is a language, we invented it, and it should be surprising that this language is so powerful that we can work out the motion of the stars from a piece of paper. Math is really the closest thing we have to a magical language https://web.archive.org/web/20210212111540/http://www.dartmo...

robotresearcher · a month ago

Broadly 'world' means 'the domain I'm interested in'. In current use in the DNN context 'world' tends to be physical space at a scale relevant to humans or robots (eg. autonomous vehicles). So when someone says 'world model' you have to ask 'what kind of world, and how is it represented?'.

We don't need to agree on one very specific meaning, which is good, because we would fail.

padolsey · a month ago

Yeh I still don't think there's a fixed definition of what a world model is or in what modality it will emerge. I'm unconvinced it will emerge as a satisfying 3d game-like first-person walkthrough.

voodooEntity · a month ago

Ye but there wont be, same as with "agi" and "ai" depends on whom you are asking *shrug

ProofHouse · a month ago

I think absolutely it will in a year

butifnot0701 · a month ago

but it sounds cool

keyle · a month ago

I'm floored. Incredible work.

also check out their interactive examples on the webapp. It's a bit more rough around the edges but shows real user input/output. Arguably such examples could be pushed further to better quality output.

e.g. https://marble.worldlabs.ai/world/b75af78a-b040-4415-9f42-6d...

e.g. https://marble.worldlabs.ai/world/cbd8d6fb-4511-4d2c-a941-f4...

wongarsu · a month ago

Unsurprisingly the results are by far the best in the area shown in the image in the prompt, and quickly deteriorate beyond it, or more than a couple meters behind the camera.

It's worlds better than just doing gaussian splats from images, but given how much the quality is influenced by images the limit to four images with text prompt or eight images without prompt is quite limiting. That's plenty to describe a chair, but almost nothing to describe a home or a space station. I hope they can extend those limits in future updates

moi2388 · a month ago

Very cool tech, extremely annoying voice over.

I like that they distinguish between the collider mesh (lower poly) and the detailed mesh (higher poly).

As a game developer I'm looking for:

• Export low-poly triangle mesh (ideally OBJ or FBX format — something fairly generic, nothing too fancy) • Export texture map • Export normals • Bonus: export the scene as "de-structured" objects (e.g. instead of a giant world mesh with everything baked into it, separate exports for foreground and background objects to make it more game engine-ready.

Gaussian splats are awesome, but not critical for my current renderers. Cool to have though.

ehnto · a month ago

Aren't the gausian splats the output here? Or are these worlds fully meshed and textured assets?

From my understanding, admittedly quite a shallow look so far, the model generates gaussian splats then from that could implement the collider.

I guess from the splat and the colliders you could generate actual assets that could be interactable/animated/have physics etc. Unsure, exciting space though! I just don't know how I would properly use this in a game, the examples are all quite on-rails and seem to avoid interacting too much with stuff in the environment.

lewispollard · a month ago

The page shows, near the bottom, how the main output is gaussian splats, but it can also generate triangular meshes (visual mesh + collider).

However, to my eye, the triangular meshes shown look pretty low quality compared to the splat: compare the triangulated books on the shelves, and the wooden chair by the door, as well as weird hole-like defects in the blanket by the fireplace.

It's also not clear if it's generating one mesh for the entire world, it looks like it is - that would make interactability and optimisation more difficult (no frustrum culling etc, though you could feasibly chop the mesh up into smaller pieces I suppose).

chaosprint · a month ago

Feifei is a great researcher. But to be honest, the progress her company has made in "world modeling" seems to deviate somewhat from what she has advertised, which is a bit disappointing. As this article (https://entropytown.com/articles/2025-11-13-world-model-lecu...) summarizes, she is mainly working on 3DGS applications. The problem is that, despite the substantial funding, this demo video clearly avoids the essentials; the camera movement is merely a panning motion. It's safe to assume that adding even a one-second extra second to each shot would drastically reduce the quality. It offers almost no improvement over the earliest 3DGS demo, let alone the addition of any characters.

Tadpole9181 · a month ago

I'm confused, the article talks about static generation. It creates a gaussian splat or models, which are rendered by an engine. This isn't a real-time model or a normal video generator like Sora, or am I misreading?

Is Marble's definition of a "world model" the same as Yann LeCun's definition of a world model? And is that the same as Genie's definition of a world model?

dandersch · a month ago

Pretty sure it's used as a marketing term here. They train on images that you generate/give it, but the output of that training is not a model, it's a static 3d scene made up out of gaussian splats. You are not running inference on a model when traversing one of those scenes, you are just rendering the splats.

v9v · a month ago

At the very least it differs greatly from "world model" as understood in earlier robotics and AI research, wherein it referred to a model describing all the details of the world outside the system relevant to the problem at hand.

dr_dshiv · a month ago

Very different, it would seem. Then again, it’s never been clear to me why LeCun believes that LLM architectures don’t inherently produce world models in the course of training.

aaroninsf · a month ago

Nor I.

IMO LLM more or less literally cannot do what they do without a world model, not least because much of what language is, is a protocol for making assertions about that model, testing the degree to which it is shared, and seeking to alter the model one carries of one's interlocutor's model.

To the "parrot people" I suggest, there is no more optimized mechanism for the inner layers of a network to approach than one which most parsimoniously models the world, so as to correctly emit tokens reflective of that.

msteffen · a month ago

pankajdoharey · a month ago

This is not a world model, this ise at best the reimplementation of the the NVIDIA prior art around NeRF / 3D Gaussian Splatting and monocular depth, wrapped in a nice product and workflow. What they’re actually shipping is an offline asset generator: you feed it text, images, or video, it runs depth/structure estimation and neural 3D reconstruction, and you get a static splat/mesh world you can then render or simulate in a real engine. That’s useful and impressive engineering, but it’s very different from a proper “world model” in the RL/embodied‑AI sense. Here there’s no online dynamics, no agent loop, and no interactive rollouts; it’s closer to a high‑end NeRF/GS pipeline plus tooling than to something like Google’s Genie/2/3, which actually couples generative rendering with action‑conditioned temporal evolution. Calling this a “world model” feels more like marketing language than a meaningful technical distinction. Infact my definition of a world model is more closer to what Demis has hinted in his discussions, that video gen models like veo are able to intuit they physics from just video trainingdata suggest that there is an underlying manifold in reality that is essentially computable and thus is being simulated by these models. Building such a model would essentially mean building a physics engine of some kind that predicts this manifold.

dvrp · a month ago

anilgulecha · a month ago

This is going to take movie making to another level because now we can: 1. Generate a full scene 2. Generate a character with specific movements.

Combine these 2, and we can have moving cameras as needed (long takes). This is going to make storytelling very expressive.

Incredible times! Here's a bet: We'll have a AI superstar (A-list level) in the next 12 months.

An a-list level actor superstar within 12 months?

I’m willing to take that bet. Name any amount you’re willing to lose.

Before you agree: movies take more than 1 year to make and get published, and it takes more than 1 movie to make somebody an a-lister

Fair warning, when I last put up a bet in AI video arena, I won! https://www.linkedin.com/posts/anilgulecha_kitsune-activity-...

Same terms - gentlemen's agreement. The loser owes the winner a meal whenever they meet :). For a HN visitor to blore, I'll happy to host a meal anyway :)

qwertytyyuu · a month ago

the last few % that make a a-list actor an a-list actor is the hardest part, i would bet you that its going to take longer than 12 months

specproc · a month ago

Hard disagree. CG in films is awful when done cheaply, and this all looks like really cheap CG.

turnsout · a month ago

Yeah, it's not quite there yet, but think of this as Stable Diffusion 1, or DALL-E 1/2. It's hard to imagine this not being a part of the VFX workflow within 5 years.

hitarpetar · a month ago

> This is going to make storytelling very expressive.

finally! we should come up for a term for this new tech... maybe computer generated imagery?