Sora: Creating video from text

This is insane. But I'm impressed most of all by the quality of motion. I've quite simply never seen convincing computer-generated motion before. Just look at the way the wooly mammoths connect with the ground, and their lumbering mass feels real.

Motion-capture works fine because that's real motion, but every time people try to animate humans and animals, even in big-budget CGI movies, it's always ultimately obviously fake. There are so many subtle things that happen in terms of acceleration and deceleration of all of the different parts of an organism, that no animator ever gets it 100% right. No animation algorithm gets it to a point where it's believable, just where it's "less bad".

But these videos seem to be getting it entirely believable for both people and animals. Which is wild.

And then of course, not to mention that these are entirely believable 3D spaces, with seemingly full object permanence. As opposed to other efforts I've seen which are basically briefly animating a 2D scene to make it seem vaguely 3D.

patall · 2 years ago

I disagree, just look at the legs of the woman in the first video. First she seems to be limping, than the legs rotate. The mammoth are totally uncanny for me as its both running and walking at the same time.

Don't get me wrong, it is impressive. But I think many people will be very uncomfortable with such motion very quickly. Same story as the fingers before.

netcan · 2 years ago

> I think many people will be very uncomfortable with such motion very quickly

So... I think OP's point stands. (impressive, surpasses human/algorithmic animation thus far).

You're also right. There are "tells." But, a tell isn't a tell until we've seen it a few times.

Jaron Lanier makes a point about novel technology. The first gramophone users thought it sounded identical to live orchestra. When very early films depicting a train coming towards a camera, and people fell out of their chairs... Blurry black and white, super slow frame rate projected on a bedsheet.

Early 3d animation was mindblowing in the 90s. Now it seems like a marionette show. Well... I suppose there was a time when marionette shows were not campy. They probably looked magic.

It seems we need some experience before we internalize the tells and it starts to look fake. My own eye for CG images seems to improving faster then the quality. We're all learning to recognize GPT generated text. I'm sure these motion captures will look more fake to us soon.

That said... the fact that we're having this discussion proves that what we have here is "novel." We're looking at a breakthrough in motion/animation.

Also, I'm not sure "real" is necessary. For games or film what we need is rich and believable, not real.

justworkout · 2 years ago

I think a lot of these issues could be "solved" by lowering the resolution, using a low quality compression algorithm, and trimming clips down to under 10 seconds.

And by solved, I mean they'll create convincing clips that'll be hard for people to dismiss unless they're really looking closely. I think it's only a matter of time until fake video clips lead to real life outrage and violence. This tech is going to be militarized before we know it.

lukan · 2 years ago

Yeah, it looks good at first glance. Also the fingers are still weird. And I suppose for every somewhat working vid, there were dozens of garbage. At least that was my experience with image generation.

I don't believe, movie makers are out of buisness any time soon. They will have to incorporate it though. So far this can make convincing background scenery.

sinuhe69 · 2 years ago

Yeah. I think people nowadays are in a kind of AI-euphoria and they took every advancement in AI for more than what they really are. The realization of their limitations will set in once people have been working long enough on the stuff. The capacity of the newfangled AIs are impressive. But even more impressive are their mimicry capabilities.

dugite-code · 2 years ago

And further down the page the:

"The camera follows behind a white vintage SUV with a black roof": The letters clearly wobble inconsistently.

"A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast": The woman in the white dress in the bottom left suddenly splits into multiple people like she was a single cell microbe multiplying.

Hoasi · 2 years ago

> I disagree, just look at the legs of the woman in the first video.

The people behind her all walk at the same pace and seem like floating. The moving reflections, on the other hand, are impressive make-believe.

matt_s · 2 years ago

Yep. If you look at the detail you can find obvious things wrong and these are limited to 60s in length with zero audio so I doubt full motion picture movies are going to be replaced anytime soon. B-roll background video or AI generated backgrounds for a green screen sure.

I would expect any subscription to use this service when it comes out to be very expensive. At some point I have to imagine the GPU/CPU horsepower needed will outweigh the monetary costs that could be recovered. Storage costs too. Its much easier to tinker with generating text or static images in that regard.

Of note: NVDA's quarterly results come out next week.

anigbrowl · 2 years ago

Same story as the fingers before.

This is weird to me considering how much better this is than the SOTA still images 2 years ago. Even though there's weirdo artefacts in several of their example videos (indeed including migrating fingers), that stuff will be super easy to clean up, just as it is now for stills. And it's not going to stop improving.

bamboozled · 2 years ago

Agreed and these are the cherry picked examples of course.

samstave · 2 years ago

>>>just look at the legs of the woman

Denise Richards hard sharp knees in '97

these infant tech are already insanely good... just wait and rahter try to focus on the "what should I be betting on in 5 years from now?

I suggest 'invisibility cloaks' (ghosts in machines?)

jstummbillig · 2 years ago

> But I think many people will be very uncomfortable with such motion very quickly.

Given the momentum in this space, I think you will have get very uncomfortable super quick about any of the shortcomings of any particular model.

josemanuel · 2 years ago

At second 15, of the woman video, the legs switch sides!! Definitely there are some glitches :)

4b11b4 · 2 years ago

The left and right side of her face are almost... a different person.

gerash · 2 years ago

When others create text to video systems (eg. Lumiere from Google) they publish the research (eg. https://arxiv.org/pdf/2401.12945.pdf). Open AI is all about commercialization. I don't like their attitude

comex · 2 years ago

Google is hardly a good actor here. They just announced Gemini 1.5 along with a "technical report" [1] whose entire description of the model architecture is: "Gemini 1.5 Pro is a sparse mixture-of-expert (MoE) Transformer-based model". Followed by a list of papers that it "builds on", followed by a definition of MoE. I suppose that's more than OpenAI gave in their GPT-4 technical report. But not by much!

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...

jstummbillig · 2 years ago

Not to be overly cute, but if the cutting edge research you do is maybe changing the world fundamentally, forever, guarding that tech should be really, really, really far up your list of priorities and everyone else should be really happy about your priorities.

And that should probably take precedence over the semantics of your moniker, every single time (even if hn continues to be super sour about it)

y_gy · 2 years ago

Ironic, isn't it! OpenAI started out "open," publishing research, and now "ClosedAI" would be a much better name.

neya · 2 years ago

When has OpenAI - for a company named "Open" AI ever released any of their stuff into anything open?

disillusioned · 2 years ago

More like ClosedAI, amirite?

mtillman · 2 years ago

OAI requires a real mobile phone number to signup and are therefore an adtech company.

Sohcahtoa82 · 2 years ago

> Motion-capture works fine because that's real motion

Except in games where they mo-cap at a frame rate less than what it will be rendered at and just interpolate between mo-cap samples, which makes snappy movements turn into smooth movements and motions end up in the uncanny valley.

It's especially noticeable when a character is talking and makes a "P" sound. In a "P", your lips basically "pop" open. But if the motion is smoothed out, it gives the lips the look of making an "mm" sound. The lips of someone saying "post" looks like "most".

At 30 fps, it's unnoticeable. At 144 fps, it's jarring once you see it and can't unsee it.

omega3 · 2 years ago

Out of all the examples, the wooly mammoths one actually feels like CGI the most to me, the other ones are much more believable than this one.

mtlmtlmtlmtl · 2 years ago

Possibly because there are no videos or even photos of live wooly mammoths, but loads and loads of CG recreations in various documentaries.

mikeInAlaska · 2 years ago

I saw the cat in the bed grows an extra limb...

windowshopping · 2 years ago

Huh, strong disagree. I've seen realistic CGI motion many times and I don't consider this to feel realistic at all.

bamboozled · 2 years ago

I’m a bit thrown off by the fact the mammoths are steaming, is that normal for mammoths ?

throw310822 · 2 years ago

Good question :)

colordrops · 2 years ago

You might just be subject to confirmation bias here. Perhaps there were scenes and entities you didn't realize were CGI due to high quality animation, and thus didn't account for them in your assessment.

lastdong · 2 years ago

Regarding CGI, I think it has became so good that you don’t know it’s CGI. Look at the dog in Guardians of the Galaxy 3. There’s a whole series on YouTube called “no cgi is really just invisible cgi” that I recommend watching.

And as with cgi, models like SORA will get better until you can’t tell reality apart. It's not there Yet, but an immense astonishingly breakthrough.

kitd · 2 years ago

Maybe it's my anthropocentric brain, but the animals move realistically while the people still look quite off.

It's still an unbelievable achievement though. I love the paper seahorse whose tail is made (realistically) using the paper folds.

samstave · 2 years ago

Serious: Can one just pipe an SRT (subtitle file) and then tell it to compare its version to the mp4 and then be able to command it to zoom, enhance, edit, and basically use it to remould content. I think this sounds great!

geor9e · 2 years ago

It's possible that through sheer volume of training, the neural network essentially has a 3D engine going on, or at least picked up enough of the rules of light and shape and physics to look the same as unreal or unity

samsullivan · 2 years ago

It would have to in order to produce the outputs, our brains have crazy physics engines though, F1 drivers can simulate an entire race in their heads.

djmips · 2 years ago

I'm not sure I feel the same way about the mammoths - and the billowing snow makes no sense as someone who grew up in a snowy area. If the snow was powder maybe but that's not what's depicted on the ground.

isthispermanent · 2 years ago

Pixar is computer generated motion, no?

viewtransform · 2 years ago

Main Pixar characters are all computer animated by humans. Physics effects like water, hair, clothing, smoke and background crowds use computer physics simulation but there are handles allowing an animator to direct the motion as per the directors wishes.

minimaxir · 2 years ago

With extreme amounts of man-hours to do so.

kaba0 · 2 years ago

> I've quite simply never seen convincing computer-generated motion before

I’m fairly sure you have seen it many times, it was just so convincing that you didn’t realize it was CGI. It’s a fundamentally biased way to sample it, as you won’t see examples of well executed stuff.

Deleted Comment

globular-toast · 2 years ago

Nah this still has the problem with connecting surfaces that never seems to look right in any CGI. It's actually interesting that it doesn't look right here as well considering they are completely different techniques.

swamp40 · 2 years ago

It's been trained on videos exclusively. Then GPT-4 interprets your prompt for it.

belter · 2 years ago

Just setup a family password last week...Now it seems every member of the family will have to become their own certificate authority and carry an MFA device.

"Worried About AI Voice Clone Scams? Create a Family Password" - https://www.eff.org/deeplinks/2024/01/worried-about-ai-voice...

unsigner · 2 years ago

Don't think of them as "computer-generated" any more than your phone's heavily processed pictures are "computer-generated", or JWST's false color, IR-to-visible pictures are "computer-generated".

This article makes a convincing argument: https://studio.ribbonfarm.com/p/a-camera-not-an-engine

lynguist · 2 years ago

That is such a gem of an article that looks at AI with a new lens I haven’t encountered before:

- AI sees and doesn’t generate

- It is dual to economics that pretends to describe but actually generates

I think the implications go much further than just the image/video considerations.

This model shows a very good (albeit not perfect) understanding of the physics of objects and relationships between them. The announcement mentions this several times.

The OpenAI blog post lists "Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care." as one of the "failed" cases. But this (and "Reflections in the window of a train traveling through the Tokyo suburbs.") seem to me to be 2 of the most important examples.

- In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo. - In the chair one, OpenAI says the model failed to model the physics of the object (which hints that it did try to, which is not how the early diffusion models worked ; they just tried to generate "plausible" images). And we can see one of the archeologists basically chasing the chair down to grab it, which does correctly model the interaction with a floating object.

I think we can't underestimate how crucial that is to the building of a general model that has a strong model of the world. Not just a "theory of mind", but a litteral understanding of "what will happen next", independently of "what would a human say would happen next" (which is what the usual text-based models seem to do).

This is going to be much more important, IMO, than the video aspect.

bamboozled · 2 years ago

Wouldn't having a good understanding of physics mean you know that a women doesn't slide down the road when she walks? Wouldn't it know that a woolly mammoth doesn't emit profuse amounts steam when walking on frozen snow? Wouldn't the model know that legs are solid objects in which other object cannot pass through?

Maybe I'm missing the big picture here, but the above and all the weird spatial errors, like miniaturization of people make me think you're wrong.

Clearly the model is an achievement and doing something interesting to produce these videos, and they are pretty cool, but understanding physics seems like quite a stretch?

I also don't really get the excitement about the girl on the train in Tokyo:

In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo

I don't know a lot about how this model works personally, but I'm guessing in the training data the vast majority of people riding trains in Tokyo featured asian people in them, assuming this model works on statistics like all of the other models I've seen recently from Open AI, then why is it interesting the girl in the reflection was Asian? Did you not expect that?

csomar · 2 years ago

> Wouldn't having a good understanding of physics mean you know that a women doesn't slide down the road when she walks? Wouldn't it know that a woolly mammoth doesn't emit profuse amounts steam when walking on frozen snow? Wouldn't the model know that legs are solid objects in which other object cannot pass through?

This just hit me but humans do not have a good understanding of physics; or maybe most of humans have no understanding of physics. We just observe and recognize whether it's familiar or not.

AI will need to be, that being the case, way more powerful than a human mind. Maybe orders of magnitude more "neural networks" than a human brain has.

pera · 2 years ago

I agree, to me the most clear example is how the rocks in the sea vanish/transform after the wave: The generated frames are hyperreal for sure, but the represented space looks as consistent as a dream.

pests · 2 years ago

They could test this by trying to generate the same image but set in New York, etc. I bet it would still be asain.

barfingclouds · 2 years ago

Give it a year

livshitz · 2 years ago

The answer could be in between. Who said delusion models are limited to 2d pixel generations?

RhysU · 2 years ago

> very good... understanding of the physics of objects and relationships between them

I am always torn here. A real physics engine has a better "understanding" but I suspect that word applies to neither Sora nor a physics engine: https://www.wikipedia.org/wiki/Chinese_room

An understanding of physics would entail asking this generative network to invert gravity, change the density or energy output of something, or atypically reduce a coefficient of friction partway through a video. Perhaps Sora can handle these, but I suspect it is mimicking the usual world rather than understanding physics in any strong sense.

None of which is to say their accomplishment isn't impressive. Only that "understand" merits particularly careful use these days.

mewpmewp2 · 2 years ago

Question is - how much do you need to understand something in order to mimick it?

The Chinese Room seems to however point to some sort of prewritten if-else type of algorithm type of situation. E.g. someone following scripted algorithmic procedures might not understand the content, but obviously this simplification is not the case with LLMs or this video generation, as the algorithmic scripting requires pre-written scripts.

Chinese Room seems to more refer to cases like "if someone tells me "xyz", then respond with "abc" - of course then you don't understand what xyz or abc mean, but it's not referring to neural networks training on ton of material to build this model representation of things.

seydor · 2 years ago

Facebook released something in that direction today https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...

sebastiennight · 2 years ago

Wow this is a huge announcement too, I can't believe this hasn't made the front page yet.

gspetr · 2 years ago

This seems to be completely in line with the previous "AI is good when it's not news" type of work:

Non-news: Dog bites a man.

News: Man bites a dog.

Non-news: "People riding Tokyo train" - completely ordinary, tons of similar content.

News: "Archaeologists dust off a plastic chair" - bizarre, (virtually) no similar content exists.

sva_ · 2 years ago

I found the one about the people in Lagos pretty funny. The camera does about a 360deg spin in total, in the beginning there are markets, then suddenly there are skyscrapers in the background. So there's only very limited object permanence.

> A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.

> https://cdn.openai.com/sora/videos/lagos.mp4

bamboozled · 2 years ago

Also the women in red next to the people is very tiny and the market stall is also a mini market stall, and the table is made out of a bike.

For everyone that's carrying on about this thing understanding physics and has a model of the world...it's an odd world.

po · 2 years ago

In the video of the girl walking down the Tokyo city street, she's wearing a leather jacket. After the closeup on her face they pull back and the leather jacket has hilariously large lapels that weren't there before.

vingt_regards · 2 years ago

There are also perspective issues: the relative sizes of the foreground (the people sitting at the café) and the background (the market) are incoherent. Same with the "snowy Tokyo with cherry blossoms" video.

lostemptations5 · 2 years ago

Though I'm not sure your point here -- outside of America -- in Asia and Africa -- these sorts of markets mixed in with skyscrapers are perfectly normal. There is nothing unusual about it.

PoignardAzur · 2 years ago

Yeah, some of the continuity errors in that one feel horrifying.

cruffle_duffle · 2 years ago

> then suddenly there are skyscrapers in the background. So there's only very limited object permanence.

Ah but you see that is artistic liberty. The director wanted it shot that way.

XCSme · 2 years ago

It doesn't understand physics.

It just computes next frame based on current one and what it learned before, it's a plausible continuation.

In the same way, ChatGPT struggles with math without code interpreter, Sora won't have accurate physics without a physics engine and rendering 3d objects.

Now it's just a "what is the next frame of this 2D image" model plus some textual context.

yberreby · 2 years ago

> It just computes next frame based on current one and what it learned before, it's a plausible continuation.

...

> Now it's just a "what is the next frame of this 2D image" model plus some textual context.

This is incorrect. Sora is not an autoregressive model like GPT, but a diffusion transformer. From the technical report[1], it is clear that it predicts the entire sequence of spatiotemporal patches at once.

[1]: https://openai.com/research/video-generation-models-as-world...

famouswaffles · 2 years ago

GPT-4 doesn't "struggle with math". It does fine. Most humans aren't any better.

Sora is not autoregressive anyway but there's nothing "just" and next frame/token prediction.

Deleted Comment

timdiggerm · 2 years ago

> In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo.

How is this any more accurate than saying that the model has mostly seen Asian people in footage of Tokyo, and thus it is most likely to generate Asian-features for a video labelled "Tokyo"? Similarly, how many videos looking out a train window do you think it's seen where there was not a reflection of a person in the window when it's dark?

shostack · 2 years ago

I'm hoping to see progress towards consistent characters, objects, scenes etc. So much of what I'd want to do creatively hinges on needing persisting characters who don't change appearance/clothing/accessories from usage to usage. Or creating a "set" for a scene to take place in repeatedly.

I know with stable diffusion there's things like lora and controlnet, but they are clunky. We still seem to have a long way to go towards scene and story composition.

Once we do, it will be a game changer for redefining how we think about things like movies and television when you can effectively have them created on demand.

dang · 2 years ago

Related ongoing thread: Video generation models as world simulators - https://news.ycombinator.com/item?id=39391458 - Feb 2024 (43 comments)

Also (since it's been a while): there are over 2000 comments in the current thread. To read them all, you need to click More links at the bottom of the page, or like this:

https://news.ycombinator.com/item?id=39386156&p=2

https://news.ycombinator.com/item?id=39386156&p=3

https://news.ycombinator.com/item?id=39386156&p=4[etc.]

crazygringo · 2 years ago

treesciencebot · 2 years ago

This is leaps and bounds beyond anything out there, including both public models like SVD 1.1 and Pika Labs' / Runway's models. Incredible.

drdaeman · 2 years ago

Let's hold our breath. Those are specifically crafted hand-picked good videos, where there wasn't any requirement but "write a generic prompt and pick something that looks good", with no particular requirements. Which is very different from the actual process where you have a very specific idea and want the machine to make it happen.

DALL-E presentation also looked cool and everyone was stoked about it. Now that we know of its limitations and oddities? YMMV, but I'd say not so much - Stable Diffusion is still the go-to solution. I strongly suspect the same thing with Sora.

The examples are most certainly cherry-picked. But the problem is there are 50 of them. And even if you gave me 24 hour full access to SVD1.1/Pika/Runway (anything out there that I can use), I won't be able to get 5 examples that match these in quality (~temporal consistency/motions/prompt following) and more importantly in the length. Maybe I am overly optimistic, but this seems too good.

htrp · 2 years ago

https://twitter.com/sama/status/1758200420344955288

They're literally taking requests and doing them in 15 minutes.

999900000999 · 2 years ago

The year is 2030.

Sarah is a video sorter, this was her life. She graduated top of her class in film, and all she could find was the monotonous job of selecting videos that looked just real enough.

Until one day, she couldn't believe it. It was her. A video of of her in that very moment sorting. She went to pause the video, but stopped when he doppelganger did the same.

dragonwriter · 2 years ago

> Stable Diffusion is still the go-to solution. I strongly suspect the same thing with Sora.

Sure, for people who want detailed control with AI-generated video, workflows built around SD + AnimateDiff, Stable Video Diffusion, MotionDiff, etc., are still going to beat Sora for the immediate future, and OpenAI's approach structurally isn't as friendly to developing a broad ecosystem adding power on top of the base models.

OTOH, the basic simple prompt-to-video capacity of Sora now is good enough for some uses, and where detailed control is not essential that space is going to keep expanding -- one question is how much their plans for safety checking (which they state will apply both to the prompt and every frame of output) will cripple this versus alternatives, and how much the regulatory environment will or won't make it possible to compete with that.

It doesn't matter if they're cherrypicked when you can't match this quality with SD or Pika regardless of how much time you had.

and i still prefer Dalle-3 to SD.

sebzim4500 · 2 years ago

In the past the examples tweeted by OpenAI have been fairly representative of the actual capabilities of the model. i.e. maybe they do two or three generations and pick the best, but they aren't spending a huge amount of effort cherry-picking.

ChildOfChaos · 2 years ago

Stable diffusion is not the go-to solution, it's still behind midjourney and DAllE

educaysean · 2 years ago

Would love to see handpicked videos from competitors that can hold their own against what SORA is capable of

Look at Sam altman’s twitter where he made videos on demand from what people prompted him

schleck8 · 2 years ago

Wrong, this is the first time I've seen an astronaut with a knit cap.

blibble · 2 years ago

they're not fantastic either if you pay close attention

there are mini-people in the 2060s market and in the cat one an extra paw comes out of nowhere

throwaway4233 · 2 years ago

While Sora might be able to generate short 60-90 second videos, how well it would scale with a larger prompt or a longer video remains yet to be seen. And the general logic of having the model do 90% of the work for you and then you edit what is required might be harder with videos.

davidbarker · 2 years ago

I'm almost speechless. I've been keeping an eye on the text-to-video models, and if these example videos are truly indicative of the model, this is an order of magnitude better than anything currently available.

In particular, looking at the video titled "Borneo wildlife on the Kinabatangan River" (number 7 in the third group), the accurate parallax of the tree stood out to me. I'm so curious to learn how this is working.

[Direct link to the video: https://player.vimeo.com/video/913130937?h=469b1c8a45]

calgoo · 2 years ago

The video of the gold rush town just makes me think of what games like Red Dead and GTA could look like.

QuadmasterXLII · 2 years ago

The diffusion is almost certainly taking place over some sort of compressed latent, from the visual quirks of the output I suspect that the process of turning that latent into images goes latent -> nerf / splat -> image, not latent -> convolutional decoder -> image

Zelphyr · 2 years ago

Agreed. It's amazing how much of a head start OpenAI appears to have over everyone else. Even Microsoft who has access to everything OpenAI is doing. Only Microsoft could be given the keys to the kingdom and still not figure out how to open any doors with them.

sailingparrot · 2 years ago

Microsoft doesn’t have access to OpenAI’s research, this was part of the deal. They only have access to the weights and inference code of production models and even then who has access to that inside MS is extremely gated and only a few employees have access to this based on absolute need to actually run the service.

AI researcher at MSFT barely have more insights about OpenAI than you do reading HN.

pcbro141 · 2 years ago

Many people say the same about Google/DeepMind.

Dead Comment

SeanAnderson · 2 years ago

Eh. MSFT owns 49% of OpenAI. Doesn't really seem like they need to do much except support them.

sschueller · 2 years ago

Yes, but I am stuck in their (American) view of what is consider appropriate. Not what is legal, but what they determine to be OK to produce.

Good luck generating anything similar to an 80s action movie. The violence and light nudity will prevent you from generating anything.

Xirgil · 2 years ago

I suspect it's less about being puritanical about violence and nudity in and of themself, and more a blanket ban to make up for the inability to prevent the generation of actually controversial material (nude images of pop stars, violence against politicians, hate speech)

throwitaway222 · 2 years ago

I am guessing a movie studio will get different access with controls dropped. Of course, that does mean they need to be VERY careful when editing, and making sure not to release a vagina that appears for 1 or 2 frames when a woman is picking up a cat in some random scene.

zamadatix · 2 years ago

I wonder how much of it is really "concern for the children" type stuff vs not wanting to deal with fights on what should be allowed and how and to who right now. When film was new towns and states started to make censorship review boards. When mature content became viewable on the web battles (still ongoing) about how much you need to do to prevent minors from accessing it came up. Now useful AI generated content is the new thing and you can avoid this kind of distraction by going this route instead.

I'm not supporting it in any way, I think you should be able to generate and distribute any legal content with the tools, but just giving a possible motive for OpenAI being so conservative whenever it comes to ethics and what they are making.

golergka · 2 years ago

I've been watching 80s movies recently, and amount of nudity and sex scenes often feels unnecessary. I'm definitely not a prude. I watch porn, I talk about sex with friends, I go to kinky parties sometimes. But it really feels that a lot of movies sacrificed stories to increase sex appeal — and now that people have free and unlimited access to porn, movies can finally be movies.

TulliusCicero · 2 years ago

It's not a particularly American attitude to be opposed to violence in media though, American media has plenty of violence.

They're trying to be all-around uncontroversial.

jsheard · 2 years ago

Where is the training material for this coming from? The only resource I can think of that's broad enough for a general purpose video model is YouTube, but I can't imagine Google would allow a third party to scrape all of YT without putting up a fight.

Zetobal · 2 years ago

It's movies the shots are way to deliberate to have random YouTube crap in the dataset.

xnx · 2 years ago

It's very good. Unclear how far ahead of Lumiere it is (https://lumiere-video.github.io/) or if its more of a difference in prompting/setttings.

vunderba · 2 years ago

The big stand out to me beyond almost any other text video solution is that the video duration is tremendously longer (minute+). Everything else that I've seen can't get beyond 15 to 20 seconds at the absolute maximum.

ehsankia · 2 years ago

In terms of following the prompt and generating visually interesting results, I think they're comparable. But the resolution for Sora seems so far ahead.

Worth noting that Google also has Phenaki [0] and VideoPoet [1] and Imagen Video [2]

[0] https://sites.research.google/phenaki/

[1] https://sites.research.google/videopoet/

[2] https://imagen.research.google/video/

mizzao · 2 years ago

Must be intimidating to be on the Pika team at the moment...

alokjnv10 · 2 years ago

you nailed it

rvz · 2 years ago

All those startups have been squeezed in the middle. Pika, Runway, etc might as well open source their models.

Or Meta will do it for them.

iLoveOncall · 2 years ago

It is incredible indeed, but I remember there was a humongous gap between the demoed pictures for DALL-E and what most prompts would generate.

Don't get overly excited until you can actually use the technology.

JKCalhoun · 2 years ago

I know it's Runway (and has all manner of those dream-like AI artifacts) but I like what this person is doing with just a bunch 4 second clips and an awesome soundtrack:

https://youtu.be/JClloSKh_dk

https://youtu.be/upCyXbTWKvQ

jasonjmcghee · 2 years ago

I agree in terms of raw generation, but runway especially is creating fantastic tooling too.

jug · 2 years ago

Yup, it's been even several months! ;) But now we finally have another quantum leap in AI.

Animats · 2 years ago

The Hollywood Reporter says many in the industry are very scared.[1]

“I’ve heard a lot of people say they’re leaving film,” he says. “I’ve been thinking of where I can pivot to if I can’t make a living out of this anymore.” - a concept artist responsible for the look of the Hunger Games and some other films.

"A study surveying 300 leaders across Hollywood, issued in January, reported that three-fourths of respondents indicated that AI tools supported the elimination, reduction or consolidation of jobs at their companies. Over the next three years, it estimates that nearly 204,000 positions will be adversely affected."

"Commercial production may be among the main casualties of AI video tools as quality is considered less important than in film and TV production."

[1] https://www.hollywoodreporter.com/business/business-news/ope...

snewman · 2 years ago

Honest question: of what possible use could Sora be for Hollywood?

The results are amazing, but if the current crop of text-to-image tools is any guide, it will be easy to create things that look cool but essentially impossible to create something that meets detailed specific criteria. If you want your actor to look and behave consistently across multiple episodes of a series, if you want it to precisely follow a detailed script, if you want continuity, if you want characters and objects to exhibit consistent behavior over the long term – I don't see how Sora can do anything for you, and I wouldn't expect that to change for at least a few years.

(I am entirely open to the idea that other generative AI tools could have an impact on Hollywood. The linked Hollywood Reporter article states that "Visual effects and other postproduction work stands particularly vulnerable". I don't know much about that, I can easily believe it would be true, but I don't think they're talking about text-to-video tools like Sora.)

I suspect that one of the first applications will be pre-viz. Before a big-budget movie is made, a cheap version is often made first. This is called "pre-visualization". These text to video applications will be ideal for that. Someone will take each scene in the script, write a big prompt describing the scene, and follow it with the dialog, maybe with some commands for camerawork and cuts. Instant movie. Not a very good one, but something you can show to the people who green-light things.

There are lots of pre-viz reels on line. The ones for sequels are often quite good, because the CGI character models from the previous movies are available for re-use. Unreal Engine is often used.

People are extrapolating out ten years. They will still have to eat and pay rent in ten years.

Karuma · 2 years ago

It wouldn't be too hard to do any of the things you mention. See ControlNet for Stable Diffusion, and vid2vid (if this model does txt2vid, it can also do vid2vid very easily).

So you can just record some guiding stuff, similar to motion capture but with just any regular phone camera, and morph it into anything you want. You don't even need the camera, of course, a simple 3D animation without textures or lighting would suffice.

Also, consistent look has been solved very early on, once we had free models like Stable Diffusion.

quickthrower2 · 2 years ago

Right now you’d need a artistic/ML mixed team. You wouldn’t use an off the shelf tool. There was a video of some guys doing this (sorry can’t find it) to make an anime type animation. With consistent characters. They used videos of themselves running through their own models to make the characters. So I reckon while prompt -> blockbuster is not here yet, a movie made using mostly AI is possible but it will cost alot now but that cost will go down. Why this is sad it is also exciting. And scary. Black mirror like we will start creating AI’s we will have relationships with and bring people back to life (!) from history and maybe grieving people will do this. Not sure if that is healthy but people will do it once it is a click of a button thing.

Qwero · 2 years ago

It shows that good progress is still made.

Just this week sd audio model can make good audio effects like doors etc.

If this continues (and it seems it will) it will change the industry tremendously.

kranke155 · 2 years ago

It won’t be Hollywood at first . It will be small social ads for TikTok, IG and social media. The brands likely won’t even care if it’s they don’t get copyright at the end, since they have copyright of their product.

Source: I work in this.

MauranKilom · 2 years ago

The OpenAI announcement mentions being able to provide an image to start the video generation process from. That sounds to me like it will actually be incredibly easy to anchor the video generation to some consistent visual - unlike all the text-based stable diffusion so far. (Yes, there is img2img, but that is not crossing the boundary into a different medium like Sora).

theptip · 2 years ago

Probably a bad time to be an actor.

Amazing time to be a wannabe director or producer or similar creative visionary.

Bad time to be high up in a hierarchical/gatekeeping/capital-constrained biz like Hollywood.

Amazing time to be an aspirant that would otherwise not have access to resources, capital, tools in order to bring their ideas to fruition.

On balance I think the ‘20s are going to be a great decade for creativity and the arts.

gwd · 2 years ago

> Probably a bad time to be an actor.

I don't see why -- the distance between "here's something that looks almost like a photo, moving only a little bit like a mannequin" and "here's something that has the subtle facial expressions and voice to convey complex emotions" is pretty freaking huge; to the point where the vast majority of actual humans fail to be that good at it. At any rate, the number of BNNs (biological neural networks) competing with actors has only been growing, with 8 billion and counting.

> Amazing time to be a wannabe director or producer or similar creative visionary. Amazing time to be an aspirant that would otherwise not have access to resources, capital, tools in order to bring their ideas to fruition.

Perhaps if you mainly want to do things for your own edification. If you want to be able to make a living off it, you're suddenly going to be in a very, very flooded market.

hackermatic · 2 years ago

It's always a bad time to be an actor, between long hours, low pay, and a culture of abuse, but this will definitely make it worse. My writer and artist friends are already despondent from genAI -- it was rare to be able to make art full-time, and even the full-timers were barely making enough money to live. Even people writing and drawing for marketing were not exactly getting rich.

I think this will lead to a further hollowing-out of who can afford to be an actor or artist, and we will miss their creativity and perspective in ways we won't even realize. Similarly, so much art benefits from being a group endeavor instead of someone's solo project -- imagine if George Lucas had created Star Wars entirely on his own.

Even the newly empowered creators will have to fight to be noticed amid a deluge of carelessly generated spam and sludge. It will be like those weird YouTube Kids videos, but everywhere (or at least like indie and mobile games are now). I think the effect will be that many people turn to big brands known for quality, many people don't care that much, and there will be a massive doughnut hole in between.

I'm thinking people will probably still want to see their favorite actors, so established actors may sell the rights to their image. They're sitting on a lot of capital. Bad time to be becoming an actor though.

You can’t replace actors with this for a long time. Actors are “rendering” faster than any AI. Animation is where the real issues will show up first, particularly in Advertising.

dingclancy · 2 years ago

The idea that this destroys the industry is overblown, because the film industry has already been dying since 2000's.

Hollywood is already destroyed. It is not the powerful entity it once was.

In terms of attention and time of entertainment, Youtube has already surpassed them.

This will create a multitude more YouTube creators that do not care about getting this right or making a living out of it. It will just take our attention all the same, away from the traditional Hollywood.

Yes there will still be great films and franchises, the industry is shrinking.

This is similar with Journalism saying that AI will destroy it. Well there was nothing to destroy because the a bunch of traditional newspapers already closed shop even before AI came.

hcarvalhoalves · 2 years ago

They shouldn’t be worried so soon. This will be used to pump out shitty hero movies more quickly, but there will always be demand for a masterpiece after the hype cools down.

This is like a chef worrying going out of business because of fast food.

FrozenSynapse · 2 years ago

Yeah, but how many will work on that singular masterpiece? The rest will be reduced and won’t have a job to put food on the table

LegitShady · 2 years ago

Without a change in copyright law, I doubt it. The current policy of the USCO is that the products of AI based on prompts like this are not human authored and can't be copywritten. No one is going to release AI created stuff that someone else can reproduce because its public domain.

neutralx · 2 years ago

Has anyone else noticed the leg swap in Tokyo video at 0:14. I guess we are past uncanny, but I do wonder if these small artifacts will always be present in generated content.

Also begs the question, if more and more children are introduced to media from young age and they are fed more and more with generated content, will they be able to feel "uncanniness" or become completely blunt to it.

There's definitely interesting period ahead of us, not yet sure how to feel about it...

There are definitely artifacts. Go to the 9th video in the first batch, the one of the guy sitting on a cloud reading a book. Watch the book; the pages are flapping in the wind in an extremely strange way.

daxfohl · 2 years ago

The third batch, the one with the cat, the guy in bed has body parts all over, his face deforms, and the blanket is partially alive.

Crespyl · 2 years ago

In the one with the cat waking up its owner, the owners shoulder turns into a blanket corner when she rolls over.

Kydlaw · 2 years ago

Yep, I noticed it immediately too. Yet it is subtle in reality. I'm not that good to spot imperfections on picture but on the video I immediately felt something was not quite right.

elicksaur · 2 years ago

Tangent to feeling numb to it - will it hinder children developing the understanding of physics, object permanence, etc. that our brains have?

There have been children, that reacted iritated, when they cannot swipe away real life objects. The idea is, to give kids enough real world experiences, so this does not happen.

sunnybeetroot · 2 years ago

Kids have been exposed to decades of 2D and 3D animations that do not contain realistic physics etc; I’m assuming they developed fine?

dymk · 2 years ago

Kids aren't supposed to have screen time until they're at least a few years old anyways

jrockway · 2 years ago

I noticed at the beginning that cars are driving on the right side of the road, but in Japan they drive on the left. The AI misses little details like that.

(I'm also not sure they've ever had a couple inches of snow on the ground while the cherry blossoms are in bloom in Tokyo, but I guess it's possible.)

The cat in the "cat wakes up its owner" video has two left front legs, apparently. There is nothing that is true in these videos. They can and do deviate from reality at any place and time and at any level of detail.

hackerlight · 2 years ago

These artefacts go down with more compute. In four years when they attack it again with 100x compute and better algorithms I think it'll be virtually flawless.

I had to go back several times to 0:14 to see if it was really unusual. I get it of course, but probably watching 20 times I would have never noticed it.

hank808 · 2 years ago

Yep! Glad I wasn't the only one that saw that. I have a feeling THEY didn't see it or they wouldn't have showcased it.

ryanisnan · 2 years ago

I don't think that's the case. I think they're aware of the limitations and problems. Several of the videos have obvious problems, if you're looking - e.g. people vanishing entirely, objects looking malformed in many frames, objects changing in size incongruent with perspective, etc.

I think they just accept it as a limitation, because it's still very technically impressive. And they hope they can smooth out those limitations.

SirMaster · 2 years ago

They swap multiple times lol. Not to mention it almost always looks like the feet are slightly sliding on the ground with every step.

I mean there are some impressive things there, but it looks like there's a long ways to go yet.

They shouldn't have played it into the close up of the face. The face is so dead and static looking.

micromacrofoot · 2 years ago

certainly not perfect... but "some impressive things" is an understatement, think of how long it took to get halfway decent CGI... this AI thing is already better than clips I've seen people spend days building by hand

xkgt · 2 years ago

This is pretty impressive, it seems that OpenAI consistently delivers exceptional work, even when venturing into new domains. But looking into their technical paper, it is evident that they are benefiting from their own body of work done in the past and also the enormous resources available to them.

For instance, the generational leap in video generation capability of SORA may be possible because:

1. Instead of resizing, cropping, or trimming videos to a standard size, Sora trains on data at its native size. This preserves the original aspect ratios and improves composition and framing in the generated videos. This requires massive infrastructure. This is eerily similar to how GPT3 benefited from a blunt approach of throwing massive resources at a problem rather than extensively optimizing the architecture, dataset, or pre-training steps.

2. Sora leverages the re-captioning technique from DALL-E 3 by leveraging GPT to turn short user prompts into longer detailed captions that are sent to the video model. Although it remains unclear whether they employ GPT-4 or another internal model, it stands to reason that they have access to a superior captioning model compared to others.

This is not to say that inertia and resources are the only factors that is differentiating OpenAI, they may have access to much better talent pool but that is hard to gauge from the outside.

Imnimo · 2 years ago

https://openai.com/sora?video=big-sur

In this video, there's extremely consistent geometry as the camera moves, but the texture of the trees/shrubs on the top of the cliff on the left seems to remain very flat, reminiscent of low-poly geometry in games.

I wonder if this is an artifact of the way videos are generated. Is the model separating scene geometry from camera? Maybe some sort of video-NeRF or Gaussian Splatting under the hood?

ethbr1 · 2 years ago

Curious about what current SotA is on physics-infusing generation. Anyone have paper links?

OpenAi has a few details:

>> The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.

>> Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.

>> We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.

>> Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully.

The implied facts that it understands physics of simple scenes and any instances of cause and effect are impressive!

Although I assume that's been SotA-possible for awhile, and I just hadn't heard?

msoad · 2 years ago

On the announcement page, it specifically says Sora does not understand physics

nuz · 2 years ago

I saw similar artifacts in dalle-1 a lot (as if the image was pasted onto geometry). Definitely wouldn't surprise me if they use synthetic rasterized data to in the training, which could totally create artifacts like this.

thomastjeffery · 2 years ago

The model is essentially doing nothing but dreaming.

I suspect that anything that looks like familiar 3D-rendering limitations is probably a result of the training dataset simply containing a lot of actual 3D-rendered content.

We can't tell a model to dream everything except extra fingers, false perspective, and 3D-rendering compromises.

makin · 2 years ago

Technically we can, that's what negative prompting[1] is about. For whatever reason, OpenAI has never exposed this capability in its image models, so it remains an open source exclusive.

[1] https://stable-diffusion-art.com/how-to-use-negative-prompts...

spyder · 2 years ago

It's possible it was pre-trained on 3D renderings first, because it's easy to get almost infinite synthetic data that way, and after that they continued the training on real videos.

iandanforth · 2 years ago

In the car driving on the mountain road video you can see level-of-detail popping artifacts being reproduced, so I think that's a fair guess.

burkaman · 2 years ago

Maybe it was trained on a bunch of 3d Google Earth videos.

downWidOutaFite · 2 years ago

Doesn't look flat to me.

Edit: Here[0] I highlighted a groove in the bushes moving with perfect perspective

[0] https://ibb.co/Y7WFW39

internetter · 2 years ago

Look in the top left corner, on the plane

montag · 2 years ago

My vote is yes - some sort of intermediate representation is involved. It just seems unbelievable that it's end-to-end with 2D frames...

cush · 2 years ago

The water is on par with Avatar. Looks perfect

Wow, yeah I didn't notice it at first, but looking at the rocks in the background is actually nauseating

jquery · 2 years ago

It looks perfect to me. That's exactly how the area looks in person.