Video generation models as world simulators

I think people might be missing what this enables. It can make plausible continuations of video, with realistic physics. What happens if this gets fast enough to work _in real time_.

Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.

It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?

Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.

margorczynski · 2 years ago

You're talking about an agent with a world model used for planning. Actually generating realistic images is not really needed as the world model operates in its own compressed abstraction.

Check out V-Jepa for such a system: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...

shreezus · 2 years ago

V-Jepa is actually super impressive. I have nothing but respect for Yann LeCun & his team, they really have been on a rampage lately.

Buttons840 · 2 years ago

> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

In theory, yes. The problem is we've had AGI many times before, in theory. For example, Q learning, feed the state of any game or system through a neural network, have it predict possible future rewards, iteratively improve the accuracy of the reward predictions, and boom, eventually you arrive at the optimal behavior for any system. We've know this since... the 70's maybe? I don't know how far Q-learning goes back.

I like to do experiments with reinforcement learning and it's always exciting to think "once I turn this thing on, it's going to work well and find lots of neat solutions to the problem", and the thing is, it's true, that might happen, but usually it doesn't. Usually I see some signs of learning, but it fails to come up with anything spectacular.

I keep watching for a strong AI in a video game like Civilization as a sign that AI can solve problems in a highly complex system while also being practical enough that game creators are able to implement it in a practical way. Yes, maybe, maybe, a team with experts could solve Civilization as a research project, but that's far from being practical. Do you think we'll be able to show an AI a video of people playing Civilization and have the video predict the best moves before the AI in the game is able to predict the best moves?

corimaith · 2 years ago

Tbh I don't think an AI for Civ would that impressive, my experience is that most of time you can get away with making locally optimal decisions I.e growing your economy and fighting weaker states. The problem with current civ AI is that their economies can be often structured nonsensically, but optimized economies is usually just the matter of stacking bonuses together into specialized production zones, which can be solved via conventional algorithms.

bigyikes · 2 years ago

I’ve been dying for someone to make a Civilization AI.

It might not be too crazy of an idea - would love to see a model fine-tuned on sequences of moves.

The biggest limitation of video game AI currently is not theory, but hardware. Once home compute doubles a few more times, we’ll all be running GPT-4 locally and a competent Civilization AI starts to look realistic.

YeGoblynQueenne · 2 years ago

>> Usually I see some signs of learning, but it fails to come up with anything spectacular.

And even if it succeeds, it fails again as soon as you change the environment because RL doesn't generalise. At all. It's kind of shocking to be honest.

https://robertkirk.github.io/2022/01/17/generalisation-in-re...

LarsDu88 · 2 years ago

What I find interesting is that b/c we have so much video data, we have this thing that can project the future in 2d pixel space.

Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.

It's just that the equivalent data is not as easily available on the internet :)

uoaei · 2 years ago

That's what estimation and simulation is for. Obviously that's not what's happening in TFA but it's perfectly plausible today.

Not sure how people are concluding that realistic physics is feasible operating solely in pixel space, because obviously it can't and anyone with any experience training such models would recognize instantly the local optimum these demos represent. The point of inductive bias is to make the loss function as convex as possible by inducing a parametrization that is "natural" to the system being modeled. Physics is exactly the attempt to formalize such a model borne of human cognitive faculties and it's hard to imagine that you can do better with less fidelity by just throwing more parameters and data at the problem, especially when the parametrization is so incongruent to the inherent dynamics at play.

pzo · 2 years ago

Depth estimation improved a lot as well e.g. with Depth-Anything [0]. But those are mostly relative depth instead of metric. Also when even converted to metric they still seems have a lot of pointclouds at the edges that have to be pruned - visible in this blog [1]. Looks like those models trained on Lidar or Stereo depthmaps that has this limitations. I think we don't have enough clean training data for 3d unless we maybe train on synthetic data (then we can have plenty, generate realistic scene in Unreal Engine 5 and train on rendered 2d frames)

[0] https://github.com/LiheYoung/Depth-Anything

[1] https://medium.com/@patriciogv/the-state-of-the-art-of-depth...

samus · 2 years ago

There are also models that are trained to generate 3D models from a picture. Use it on videos, and also train it on output generated by video games.

mgoetzke · 2 years ago

imagine it going a few dimensions further, what will happen when i tell this person 'this'. how will this affect the social graph and my world state :)

YeGoblynQueenne · 2 years ago

>> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

As another comment points out that's Yann LeCun's idea of "Objective-Driven AI" introduced in [1] though not named that in the paper (LeCun has named it that way in talks and slides). LeCun has also said that this won't be achieved with generative models. So, either 1 out of 2 right, or both wrong, one way or another.

For me, I've been in AI long enough to remember many such breakthroughs that would lead to AGI before - from DeepBlue (actually) to CNNs, to Deep RL, to LLMs just now, etc. Either all those were not the breakthroughs people thought at the time, or it takes many more than an engineering breakthrough to get to AGI, otherwise it's hard to explain why the field keeps losing its mind about the Next Big Thing and then forgetting about it a few years later, when the Next Next Big Thing comes around.

But, enough with my cynicism. You think that idea can work? Try it out. In a simplified environment. Take some stupid grid world, a simplification of a text-based game like Nethack [2] and try to implement your idea, in-vitro, as it were. See how well it works. You could write a paper about it.

____________________

[1] https://openreview.net/pdf?id=BZ5a1r-kVsf

[2] Obviously don't start with Nethack itself because that's damn hard for "AI".

nopinsight · 2 years ago

I totally agree that a system like Sora is needed. By itself, it’s insufficient. With a multimodal model that can reason properly, then we get AGI or rather ASI (artificial super intelligence) due to many advantages over humans such as context length, access to additional sensory modalities (infrared, electroreception, etc), much broader expertise, huge bandwidth, etc.

future successor to Sora + likely successor to GPT-4 = ASI

See my other comment here: https://news.ycombinator.com/item?id=39391971

semi-extrinsic · 2 years ago

I call bullshit.

A key element of anything that can be classified as "general intelligence" is developing internally consistent and self-contained agency, and then being able to act on that. Today we have absolutely no idea of how to do this in AI. Even the tiniest of worms and insects demonstrate capabilities several orders of magnitude beyond what our largest AIs can.

We are about as close to AGI as James Watt was to nuclear fusion.

jiggawatts · 2 years ago

Adding to this: Sora was most likely trained on video that's more like what you'd normally see on YouTube or in a clip art or media licensing company collection. Basically, video designed to look good as a part of a film or similar production.

So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.

Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".

Imagine telling a bunch of warehouse workers that for "safety" they all need to wear a GoPro-like action camera on their helmets that record everything inside the work area. Run that in a bunch of warehouses with varying sizes, content, and forklifts, and then pump all of that through this architecture to train it. Include the instructions given to the staff from the ERP system as well as the transcribed audio as the text prompt.

Ta-da.

You have yourself an AI that can control a robot using the same action camera as its vision input. It will be able to follow instructions from the ERP, listen to spoken instructions, and even respond with a natural voice. It'll even be able to handle scenarios such as spills, breaks, or other accidents... just like the humans in its training data did. This is basically what vehicle auto-pilots do, but on steroids.

Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.

verticalscaler · 2 years ago

A 3d model with object permanency is definitely a step in the right direction of something or other but for clarity let us dial back down the level of graphical detail.

A Pacman bot is not AGI. Might get it to eat all the dots correctly where as before if something scrolled off the screen it'd forget about it and glitch out - but you didn't fan any flames of consciousness into existence as of yet.

Solvency · 2 years ago

Is a human that manages to eat all the skittles and walk without falling into deadly holes AGI? Why?

coffeebeqn · 2 years ago

The flip side of video or image gen is always video or image identification. If video gets really good then an AI can have quite an accurate visual view into the world in real time

zmgsabst · 2 years ago

That’s how we think:

Imagine where you want to be (eg, “I scored a goal!”) from where you are now, visualize how you’ll get there (eg, a trick and then a shot), then do that.

adi_pradhan · 2 years ago

Thanks for adding the specific case. I think with testing these sort of limited domain applications make sense.

It'll be much harder for more open ended world problems where the physics encountered may be rare enough in the dataset that the simulation breaks unexpectedly. For example a glass smashing into the floor. The model doesn't simulate that causally afaik

therein · 2 years ago

> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting

There was that article a few months ago about how basically that's what the cerebellum does.

mdorazio · 2 years ago

FWIW, you've basically described at a high level exactly what autonomous driving systems have been doing for several years. I don't think anyone would say that Waymo's cars are really close to AGI.

neom · 2 years ago

Figure out how to incorporate a quantum computer as a prediction engine in this idea, and you've got quite the robot on your hands. :)

(and throw this in for good measure https://www.wired.com/story/this-lab-grown-skin-could-revolu... heh)

liamYC · 2 years ago

This comment is brilliant. Thank you. I’m so excited now to build a bot that uses predictive video. I wonder what the most simple prototype would be? Surely one that has a simple validation loop that can say hey, this predicted video became true. Perhaps a 2D infinite scrolling video game?

anirudhv27 · 2 years ago

Imagine having real-time transfer of characteristics within your world in a VR/mixed reality setup. Automatically generating new views within the environment you are currently in could create pretty interesting experiences.

metabagel · 2 years ago

This sounds like it has military applications, not that I’m excited at the prospect.

pyinstallwoes · 2 years ago

So basically a brain in a vat, reality as we experience it, our thoughts as prompts.

deadbabe · 2 years ago

Imagine putting on some AR goggles

staring at a painting in a Museum

Then immediately jumping into an entire VR world based off the painting generated by an AI rendering it out on the fly

jimmySixDOF · 2 years ago

BlockadeLabs has been doing a 3D text to skybox and not exactly runtime at the moment but I have seen it work in a headset and it definitely feels like the future.

blueprint · 2 years ago

how would you define AGI?

aurareturn · 2 years ago

Sounds like simulation theory is closer and closer to being proven.

littlestymaar · 2 years ago

Our ability to build somewhat convincing simulations of thing has never been a proof of living in a simulation…

fnordpiglet · 2 years ago

Except there is always an original at the root. There’s no way to prove that’s not us.

I like that this one shows some "fails", and not just the top of the top results:

For example, the surfer is surfing in the air at the end:

https://cdn.openai.com/tmp/s/prompting_7.mp4

Or this "breaking" glass that does not break, but spills liquid in some weird way:

https://cdn.openai.com/tmp/s/discussion_0.mp4

Or the way this person walks:

https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Or wherever this map is coming from:

https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

chkaloon · 2 years ago

I've also noticed on some of the featured videos that there are some perspective/parallax errors. The human subjects in some are either oversized compared to background people, or they end up on horizontal planes that don't line up properly. It's actually a bit vertigo-inducing! It is still very remarkable

danans · 2 years ago

> Or wherever this map is coming from:

> https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

Notice also that that a roughly 6 seconds there is a third hand putting the map away.

mr_toad · 2 years ago

> For example, the surfer is surfing in the air at the end

Maybe it’s been watching snowboarding videos and doesn’t quite understand the difference.

SiempreViernes · 2 years ago

> Or the way this person walks:

> https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Also, why does she have a umbrella sticking out from her lower back?

zenon · 2 years ago

I suppose the lady usually has an umbrella in this kind of situation, so it felt the umbrella should be included in some way: https://youtu.be/492tGcBP568

sega_sai · 2 years ago

That is creepy...

SiempreViernes · 2 years ago

Take a look at these adorable kangaroos to relax:

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-...

hackerlight · 2 years ago

Where do you find the last two?

SushiHippie · 2 years ago

Part of this website changes after the video finished and switches to the next video. There is no way to control it. These are both "X wearing Y taking a pleasant stroll in Z during W"

coffeebeqn · 2 years ago

The hyper realistic and plausible movement of the glass breaking makes this bizarrely fascinating. And it doesn’t give me the feeling of disgust the motion in the more primitive AI models did

Deleted Comment