Readit News logoReadit News
empath-nirvana · 2 years ago
I think people might be missing what this enables. It can make plausible continuations of video, with realistic physics. What happens if this gets fast enough to work _in real time_.

Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.

It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?

Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.

margorczynski · 2 years ago
You're talking about an agent with a world model used for planning. Actually generating realistic images is not really needed as the world model operates in its own compressed abstraction.

Check out V-Jepa for such a system: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...

shreezus · 2 years ago
V-Jepa is actually super impressive. I have nothing but respect for Yann LeCun & his team, they really have been on a rampage lately.
Buttons840 · 2 years ago
> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

In theory, yes. The problem is we've had AGI many times before, in theory. For example, Q learning, feed the state of any game or system through a neural network, have it predict possible future rewards, iteratively improve the accuracy of the reward predictions, and boom, eventually you arrive at the optimal behavior for any system. We've know this since... the 70's maybe? I don't know how far Q-learning goes back.

I like to do experiments with reinforcement learning and it's always exciting to think "once I turn this thing on, it's going to work well and find lots of neat solutions to the problem", and the thing is, it's true, that might happen, but usually it doesn't. Usually I see some signs of learning, but it fails to come up with anything spectacular.

I keep watching for a strong AI in a video game like Civilization as a sign that AI can solve problems in a highly complex system while also being practical enough that game creators are able to implement it in a practical way. Yes, maybe, maybe, a team with experts could solve Civilization as a research project, but that's far from being practical. Do you think we'll be able to show an AI a video of people playing Civilization and have the video predict the best moves before the AI in the game is able to predict the best moves?

corimaith · 2 years ago
Tbh I don't think an AI for Civ would that impressive, my experience is that most of time you can get away with making locally optimal decisions I.e growing your economy and fighting weaker states. The problem with current civ AI is that their economies can be often structured nonsensically, but optimized economies is usually just the matter of stacking bonuses together into specialized production zones, which can be solved via conventional algorithms.
bigyikes · 2 years ago
I’ve been dying for someone to make a Civilization AI.

It might not be too crazy of an idea - would love to see a model fine-tuned on sequences of moves.

The biggest limitation of video game AI currently is not theory, but hardware. Once home compute doubles a few more times, we’ll all be running GPT-4 locally and a competent Civilization AI starts to look realistic.

YeGoblynQueenne · 2 years ago
>> Usually I see some signs of learning, but it fails to come up with anything spectacular.

And even if it succeeds, it fails again as soon as you change the environment because RL doesn't generalise. At all. It's kind of shocking to be honest.

https://robertkirk.github.io/2022/01/17/generalisation-in-re...

LarsDu88 · 2 years ago
What I find interesting is that b/c we have so much video data, we have this thing that can project the future in 2d pixel space.

Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.

It's just that the equivalent data is not as easily available on the internet :)

uoaei · 2 years ago
That's what estimation and simulation is for. Obviously that's not what's happening in TFA but it's perfectly plausible today.

Not sure how people are concluding that realistic physics is feasible operating solely in pixel space, because obviously it can't and anyone with any experience training such models would recognize instantly the local optimum these demos represent. The point of inductive bias is to make the loss function as convex as possible by inducing a parametrization that is "natural" to the system being modeled. Physics is exactly the attempt to formalize such a model borne of human cognitive faculties and it's hard to imagine that you can do better with less fidelity by just throwing more parameters and data at the problem, especially when the parametrization is so incongruent to the inherent dynamics at play.

pzo · 2 years ago
Depth estimation improved a lot as well e.g. with Depth-Anything [0]. But those are mostly relative depth instead of metric. Also when even converted to metric they still seems have a lot of pointclouds at the edges that have to be pruned - visible in this blog [1]. Looks like those models trained on Lidar or Stereo depthmaps that has this limitations. I think we don't have enough clean training data for 3d unless we maybe train on synthetic data (then we can have plenty, generate realistic scene in Unreal Engine 5 and train on rendered 2d frames)

[0] https://github.com/LiheYoung/Depth-Anything

[1] https://medium.com/@patriciogv/the-state-of-the-art-of-depth...

samus · 2 years ago
There are also models that are trained to generate 3D models from a picture. Use it on videos, and also train it on output generated by video games.
mgoetzke · 2 years ago
imagine it going a few dimensions further, what will happen when i tell this person 'this'. how will this affect the social graph and my world state :)
YeGoblynQueenne · 2 years ago
>> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

As another comment points out that's Yann LeCun's idea of "Objective-Driven AI" introduced in [1] though not named that in the paper (LeCun has named it that way in talks and slides). LeCun has also said that this won't be achieved with generative models. So, either 1 out of 2 right, or both wrong, one way or another.

For me, I've been in AI long enough to remember many such breakthroughs that would lead to AGI before - from DeepBlue (actually) to CNNs, to Deep RL, to LLMs just now, etc. Either all those were not the breakthroughs people thought at the time, or it takes many more than an engineering breakthrough to get to AGI, otherwise it's hard to explain why the field keeps losing its mind about the Next Big Thing and then forgetting about it a few years later, when the Next Next Big Thing comes around.

But, enough with my cynicism. You think that idea can work? Try it out. In a simplified environment. Take some stupid grid world, a simplification of a text-based game like Nethack [2] and try to implement your idea, in-vitro, as it were. See how well it works. You could write a paper about it.

____________________

[1] https://openreview.net/pdf?id=BZ5a1r-kVsf

[2] Obviously don't start with Nethack itself because that's damn hard for "AI".

nopinsight · 2 years ago
I totally agree that a system like Sora is needed. By itself, it’s insufficient. With a multimodal model that can reason properly, then we get AGI or rather ASI (artificial super intelligence) due to many advantages over humans such as context length, access to additional sensory modalities (infrared, electroreception, etc), much broader expertise, huge bandwidth, etc.

future successor to Sora + likely successor to GPT-4 = ASI

See my other comment here: https://news.ycombinator.com/item?id=39391971

semi-extrinsic · 2 years ago
I call bullshit.

A key element of anything that can be classified as "general intelligence" is developing internally consistent and self-contained agency, and then being able to act on that. Today we have absolutely no idea of how to do this in AI. Even the tiniest of worms and insects demonstrate capabilities several orders of magnitude beyond what our largest AIs can.

We are about as close to AGI as James Watt was to nuclear fusion.

jiggawatts · 2 years ago
Adding to this: Sora was most likely trained on video that's more like what you'd normally see on YouTube or in a clip art or media licensing company collection. Basically, video designed to look good as a part of a film or similar production.

So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.

Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".

Imagine telling a bunch of warehouse workers that for "safety" they all need to wear a GoPro-like action camera on their helmets that record everything inside the work area. Run that in a bunch of warehouses with varying sizes, content, and forklifts, and then pump all of that through this architecture to train it. Include the instructions given to the staff from the ERP system as well as the transcribed audio as the text prompt.

Ta-da.

You have yourself an AI that can control a robot using the same action camera as its vision input. It will be able to follow instructions from the ERP, listen to spoken instructions, and even respond with a natural voice. It'll even be able to handle scenarios such as spills, breaks, or other accidents... just like the humans in its training data did. This is basically what vehicle auto-pilots do, but on steroids.

Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.

verticalscaler · 2 years ago
A 3d model with object permanency is definitely a step in the right direction of something or other but for clarity let us dial back down the level of graphical detail.

A Pacman bot is not AGI. Might get it to eat all the dots correctly where as before if something scrolled off the screen it'd forget about it and glitch out - but you didn't fan any flames of consciousness into existence as of yet.

Solvency · 2 years ago
Is a human that manages to eat all the skittles and walk without falling into deadly holes AGI? Why?
coffeebeqn · 2 years ago
The flip side of video or image gen is always video or image identification. If video gets really good then an AI can have quite an accurate visual view into the world in real time
zmgsabst · 2 years ago
That’s how we think:

Imagine where you want to be (eg, “I scored a goal!”) from where you are now, visualize how you’ll get there (eg, a trick and then a shot), then do that.

adi_pradhan · 2 years ago
Thanks for adding the specific case. I think with testing these sort of limited domain applications make sense.

It'll be much harder for more open ended world problems where the physics encountered may be rare enough in the dataset that the simulation breaks unexpectedly. For example a glass smashing into the floor. The model doesn't simulate that causally afaik

therein · 2 years ago
> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting

There was that article a few months ago about how basically that's what the cerebellum does.

mdorazio · 2 years ago
FWIW, you've basically described at a high level exactly what autonomous driving systems have been doing for several years. I don't think anyone would say that Waymo's cars are really close to AGI.
neom · 2 years ago
Figure out how to incorporate a quantum computer as a prediction engine in this idea, and you've got quite the robot on your hands. :)

(and throw this in for good measure https://www.wired.com/story/this-lab-grown-skin-could-revolu... heh)

liamYC · 2 years ago
This comment is brilliant. Thank you. I’m so excited now to build a bot that uses predictive video. I wonder what the most simple prototype would be? Surely one that has a simple validation loop that can say hey, this predicted video became true. Perhaps a 2D infinite scrolling video game?
anirudhv27 · 2 years ago
Imagine having real-time transfer of characteristics within your world in a VR/mixed reality setup. Automatically generating new views within the environment you are currently in could create pretty interesting experiences.
metabagel · 2 years ago
This sounds like it has military applications, not that I’m excited at the prospect.
pyinstallwoes · 2 years ago
So basically a brain in a vat, reality as we experience it, our thoughts as prompts.
deadbabe · 2 years ago
Imagine putting on some AR goggles

staring at a painting in a Museum

Then immediately jumping into an entire VR world based off the painting generated by an AI rendering it out on the fly

jimmySixDOF · 2 years ago
BlockadeLabs has been doing a 3D text to skybox and not exactly runtime at the moment but I have seen it work in a headset and it definitely feels like the future.
blueprint · 2 years ago
how would you define AGI?
aurareturn · 2 years ago
Sounds like simulation theory is closer and closer to being proven.
littlestymaar · 2 years ago
Our ability to build somewhat convincing simulations of thing has never been a proof of living in a simulation…
fnordpiglet · 2 years ago
Except there is always an original at the root. There’s no way to prove that’s not us.
SushiHippie · 2 years ago
I like that this one shows some "fails", and not just the top of the top results:

For example, the surfer is surfing in the air at the end:

https://cdn.openai.com/tmp/s/prompting_7.mp4

Or this "breaking" glass that does not break, but spills liquid in some weird way:

https://cdn.openai.com/tmp/s/discussion_0.mp4

Or the way this person walks:

https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Or wherever this map is coming from:

https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

chkaloon · 2 years ago
I've also noticed on some of the featured videos that there are some perspective/parallax errors. The human subjects in some are either oversized compared to background people, or they end up on horizontal planes that don't line up properly. It's actually a bit vertigo-inducing! It is still very remarkable
danans · 2 years ago
> Or wherever this map is coming from:

> https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

Notice also that that a roughly 6 seconds there is a third hand putting the map away.

mr_toad · 2 years ago
> For example, the surfer is surfing in the air at the end

Maybe it’s been watching snowboarding videos and doesn’t quite understand the difference.

SiempreViernes · 2 years ago
> Or the way this person walks:

> https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Also, why does she have a umbrella sticking out from her lower back?

zenon · 2 years ago
I suppose the lady usually has an umbrella in this kind of situation, so it felt the umbrella should be included in some way: https://youtu.be/492tGcBP568
sega_sai · 2 years ago
That is creepy...
hackerlight · 2 years ago
Where do you find the last two?
SushiHippie · 2 years ago
Part of this website changes after the video finished and switches to the next video. There is no way to control it. These are both "X wearing Y taking a pleasant stroll in Z during W"
coffeebeqn · 2 years ago
The hyper realistic and plausible movement of the glass breaking makes this bizarrely fascinating. And it doesn’t give me the feeling of disgust the motion in the more primitive AI models did

Deleted Comment

modeless · 2 years ago
> Other interactions, like eating food, do not always yield correct changes in object state

So this is why they haven't shown Will Smith eating spaghetti.

> These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world

This is exciting for robotics. But an even closer application would be filling holes in gaussian splatting scenes. If you want to make a 3D walkthrough of a space you need to take hundreds to thousands of photos with seamless coverage of every possible angle, and you're still guaranteed to miss some. Seems like a model this capable could easily produce plausible reconstructions of hidden corners or close up detail or other things that would just be holes or blurry parts in a standard reconstruction. You might only need five or ten regular photos of a place to get a completely seamless and realistic 3D scene that you could explore from any angle. You could also do things like subtract people or other unwanted objects from the scene. Such an extrapolated reconstruction might not be completely faithful to reality in every detail, but I think this could enable lots of applications regardless.

SiempreViernes · 2 years ago
Do note that "reconstruction" is not the right word, the proper characterisation of that sort of imputation is "artist impression": good for situations where the precise details doesn't matter. Though of course if the details doesn't matter maybe blurry is fine.
YeGoblynQueenne · 2 years ago
Well, yeah, if the details don't matter then you don't need "highly-capable simulators of the physical and digital world". And if the details do matter, then good luck having a good enough simulation of the real world that you can invoke in real time in any kind of mobile hardware.

Deleted Comment

nopinsight · 2 years ago
AlphaGo and AlphaZero were able to achieve superhuman performance due to the availability of perfect simulators for the game of Go. There is no such simulator for the real world we live in (although pure LLMs sort of learn a rough, abstract representation of the world as perceived by humans.) Sora is an attempt to build such a simulator using deep learning.

  “Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)

Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.

Nathanba · 2 years ago
Really interesting how this goes against my intuition. I would have imagined that it's infinitely easier to analyze a camera stream of the real world, then generate a polygonal representation of what you see (like you would do for a videogame) and then make AI decisions for that geometry. Instead the way that AI is going they rather skip it all and work directly on pixel data. Understanding of 3d geometry, perspective and physics is expected to evolve naturally from the training data.
rasmusfaber · 2 years ago
Another instance of the bitter lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
stravant · 2 years ago
> then generate a polygonal representation of what you see

It's really not that surprising since, to be honest, meshes suck.

They're pretty general graphs but to actually work nicely they have to have really specific topological characteristics. Half of the work you do with meshes is repeatedly coaxing them back into a sane topology after editing them.

roenxi · 2 years ago
There is a perfect simulator of the real world available. It can be recorded with a camera! Once the researchers have a bit of time to get their bearings and figure out how to train an order of magnitude faster we'll get there.
throwaway290 · 2 years ago
That's still not a simulation if camera recording shows only what we see.
guybedo · 2 years ago
I think it's Ylecun who stated a few times that video was the better way to train large models as it's more information dense.

The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...

Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.

iliane5 · 2 years ago
Watching an entirely generated video of someone painting is crazy.

I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.

Seeing how bad video generation was, I expected it would take a few more years to get to this but it seems like this is another case of "Add data & compute"(TM) where transformers prove once again they'll learn everything and be great at it

data-ottawa · 2 years ago
I know the main post has been getting a lot of reaction, but this page absolutely blew me away. The results are striking.

The robot examples are very underwhelming, but the people and background people are all very well done, and at a level much better than most static image diffusion models produce. Generating the same people as the interact with objects is also not something I expected a model like this to do well so soon.

lairv · 2 years ago
I find it wild that this model does not have explicit 3D prior, yet learns to generate videos with such 3D consistency, you can directly train a 3D representation (NeRF-like) from those videos: https://twitter.com/BenMildenhall/status/1758224827788468722
Nihilartikel · 2 years ago
I was similarly astonished at this adaptation of stable diffusion to make HDR spherical environment maps from existing images- https://diffusionlight.github.io/

The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.

emadm · 2 years ago
Yeah we were surprised by that, video models are great 3d prices and image models are great video model priors
larschdk · 2 years ago
You aren't looking carefully enough. I find so many inconsistencies in these examples. Perspectives that are completely wrong when the camera rotates. Windows that shift perspective, patios that are suddenly deep/shallow. Shadows that appear/disappear as the camera shifts. In other examples; paths, objects, people suddenly appearing or disappearing out of nowhere. A stone turning into a person. A horse that suddenly has a second head, then becomes a separate horse with only two legs.

It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.

lairv · 2 years ago
You can literally run 3D algorithms like NeRF or COLMAP on those videos (check the tweet I sent), it's not my opinion, those videos are sufficiently 3D consistent that you can extract 3D geometry from them

Surely it's not perfect, but this was not the case for previous video generation algorithms

beezlebroxxxxxx · 2 years ago
Yeah, it seems to have a hard time processing lens distortion in particular which gives a very weird quality. It's actually bending things, or trying to fill in the gaps, instead of distorting the image in the "correct" way.
crooked-v · 2 years ago
That leaves me wondering if it'd be possible to get some variant of the model to directly output 3D meshes and camera animation instead of an image.
nodja · 2 years ago
This is also true for 2D diffusion models[1]. I suppose they need to understand how 3d works for stuff like lighting/shadows/object occlusion, etc.

[1] https://dreamfusion3d.github.io/

TOMDM · 2 years ago
I wonder how much it'd improve if trained on stereo image data.
QuadmasterXLII · 2 years ago
Moving camera is just stereo.