Veo 2: Our video generation model

I got access to the preview, here's what it gave me for "A pelican riding a bicycle along a coastal path overlooking a harbor" - this video has all four versions shown:

https://static.simonwillison.net/static/2024/pelicans-on-bic...

Of the four two were a pelican riding a bicycle. One was a pelican just running along the road, one was a pelican perched on a stationary bicycle, and one had the pelican wearing a weird sort of pelican bicycle helmet.

All four were better than what I got from Sora: https://simonwillison.net/2024/Dec/9/sora/

yurylifshits · 8 months ago

There's another important contender in the space: Hunyuan model from Tencent

My company (Nim) is hosting Hunyuan model, so here's a quick test (first attempt) at "pelican riding a bycicle" via Hunyuan on Nim: https://nim.video/explore/OGs4EM3MIpW8

I think it's as good, if not better than Sora / Veo

chrismorgan · 8 months ago

> A whimsical pelican, adorned in oversized sunglasses and a vibrant, patterned scarf, gracefully balances on a vintage bicycle, its sleek feathers glistening in the sunlight. As it pedals joyfully down a scenic coastal path, colorful wildflowers sway gently in the breeze, and azure waves crash rhythmically against the shore. The pelican occasionally flaps its wings, adding a playful touch to its enchanting ride. In the distance, a serene sunset bathes the landscape in warm hues, while seagulls glide gracefully overhead, celebrating this delightful and lighthearted adventure of a pelican enjoying a carefree day on two wheels.

What does it produce for “A pelican riding a bicycle along a coastal path overlooking a harbor”?

Or, what do Sora and Veo produce for your verbose prompt?

sashank_1509 · 8 months ago

Hard to say about SORA but the video you shared is most definitely worse than Veo.

The Pelican is doing some weird flying motion, motion blur is hiding a lack of detail, cycle is moving fast so background is blurred etc. I would even say SORA is better because I like the slow-motion and detail but it did do something very non physical.

Veo is clearly the best in this example. It has high detail but also feels the most physically grounded among the examples.

dyauspitr · 8 months ago

Pretty good except the backwards body and the strange wing movement. The feeling of motion is fantastic though.

arjie · 8 months ago

I was curious how it would perform with prompt enhancement turned off. Here's a single attempt (no regenerations etc.): https://www.youtube.com/watch?v=730cb2qozcM

If you'd like to replicate, the sign-up process was very easy and I was easily able to run a single generation attempt. Maybe later when I want to generate video I'll use prompt enhancement. Without it, the video appears to have lost a notion of direction. Most image-generation models I'm aware of do prompt-enhancement. I've seen it on Grok+Flow/Aurora and ChatGPT+DallE.

    Prompt
    A pelican riding a bicycle along a coastal path overlooking a harbor
    Seed
    15185546
    Resolution
    720×480

gcr · 8 months ago

FYI your website shows me a static image on iOS 18.2 Safari. Strangely, the progress bar still appears to “loop,” but the bird isn’t moving at all.

Turning content blockers off does not make a difference.

dr_kiszonka · 8 months ago

Reddit says it is much better than Sora. Are you hosting the full version of Nunyuan? (Your video looks great.)

prometheon1 · 8 months ago

Is it still better if you copy his whole prompt instead of half of it?

Deleted Comment

c0brac0bra · 8 months ago

I mean, the pelican's body is backwards...

tim333 · 8 months ago

Here's one of a penguin paragliding and it's surprisingly realistic https://x.com/Plinz/status/1868885955597549624

0_____0 · 8 months ago

This is the first GenAI video to produce an "oh shit" reflex in me.

oh, shit!

p1necone · 9 months ago

As long as at least one option is exactly what you asked for throwing variations at you that don't conform to 100% of your prompt seems like it could be useful if it gives the model leeway to improve the output in other aspects.

oneshtein · 8 months ago

Here is my version of pelican at bicycle made with hailuoai:

https://hailuoai.video/share/N9dlRd1L1o0p

nkingsy · 9 months ago

His little bike helmet is adorable

mckirk · 9 months ago

The AI safety team was really proud of that one.

AgentME · 8 months ago

It's funny having looked forward to Sora for a while and then seeing it be superseded so shortly after access to it is finally made public.

grumbel · 8 months ago

I am surprised that the top/right one still shows a cut and switch to a difference scene. I would assume that that's something that could be trivially filtered out of the training data, as those discontinuities don't seem to be useful for either these short 6sec video segments or for getting an understanding of the real world.

jerpint · 9 months ago

It looks much better than Sora but still kind of in uncanny valley

spaceman_2020 · 8 months ago

This is the worst it will ever be…

victorbjorklund · 8 months ago

That is surprisingly good. We are at a point where it seems to be good enough for at least b-roll content replacing stock video clips.

rob74 · 8 months ago

Well yeah, if you look closely at the example videos on the site, one of them is not quite right either:

> Prompt: The sun rises slowly behind a perfectly plated breakfast scene. Thick, golden maple syrup pours in slow motion over a stack of fluffy pancakes, each one releasing a soft, warm steam cloud. A close-up of crispy bacon sizzles, sending tiny embers of golden grease into the air. [...]

In the video, the bacon is unceremoniously slapped onto the pancakes, while the prompt sounds like it was intended to be a separate shot, with the bacon still in the pan? Or, alternatively, everything described in the prompt should have been on the table at the same time?

So, yet again: AI produces impressive results, but it rarely does exactly what you wanted it to do...

soco · 8 months ago

Technically speaking I'd say your expectation is definitely not laid out in the prompt, so anything goes. Believe me I've had such requirements from users and me as a mere human programmer am never quite sure what they actually want. So I take guesses just like the AI (because simply asking doesn't bring you very far, you must always show something) and take it from there. In other words, if AI works like me, I can pack my stuff already.

jillyboel · 8 months ago

This tech is cute but the only viable outcomes are going to be porn and mass produced slop that'll be uninteresting before it's even created. Why even bother?

andybak · 8 months ago

There will be both of those things in abundance.

But I'm also seeing some genuinely creative uses of generative video - stuff I could argue has got some genuine creative validity. I am loathe to dismiss an entire technique because it is mostly used to create garbage.

We'll have to figure out how to solve the slop problem - it was already an issues before AI so maybe this is just hastening the inevevitable.

bottled_poe · 8 months ago

Comments like this one are so predictable and incredulous. As if the current state of the art is the final form of this technology. This is just getting started. Big facepalm.

Imho is stunning, yet what is happening there is super dangerous.

These videos will and may be too realistic.

Our society is not prepared for this kind of reality "bending" media. These hyperrealistic videos will be the reason for hate and murder. Evil actors will use it to influence elections on a global scale. Create cults around virtual characters. Deny the rules of physics and human reason. And yet, there is no way for a person to detect instantly that he is watching a generated video. Maybe now, but in 1 year, it will be indistinguishable from a real recorded video

ks2048 · 8 months ago

Are Apple and other phone/camera makers working on ways to "sign" a video to say it's an unedited video from a camera? Does this exist now? Is it possible?

I'm thinking of simple cryptographic signing of a file, rather than embedding watermarks into the content, but that's another option.

I don't think it will solve the fake video onslaught, but it could help.

jazzyjackson · 8 months ago

Leica M11 signs each photo. "Content Authority Initiative" https://leica-camera.com/en-US/news/partnership-greater-trus...

Cute hack showing that its kinda useless unless the user-facing UX does a better job of actually knowing whether the certificate represents the manufacturer of the sensor (dude just uses a self signed cert with "Leica Camera AG" as the name. Clearly cryptography literacy is lagging behind... https://hackaday.com/2023/11/30/falsified-photos-fooling-ado...

ttul · 8 months ago

I think this will be a thing one day, where photos are digitally watermarked by the camera sensor in a non-repudiable manner.

bravoetch · 8 months ago

Nikon has had digital signature ability in some of their flagship cameras since at least 2007, and maybe before then. The feature is used by law enforcement when documenting evidence. I assume other brands also have this available for the same reasons.

tomp · 8 months ago

We've had realistic sci-fi and alternate history movies for a very long time.

oldmanhorton · 8 months ago

Which take millions of dollars and huge teams to make. These take one bored person, a sentence, and a few minutes to go from idea to posting on social media. That difference is the entire concern.

Dead Comment

krapp · 8 months ago

We already have hate and murder, evil actors influencing elections on a global scale, denial of physics and reason, and cults of personality. We also already have the ability to create realistic videos - not that it matters because for many people the bar of credulity isn't realism but simply confirming their priors. We already live in a world where TikTok memes and Facebook are the primary sources around which the masses base their reality, and that shit doesn't even take effort.

The only thing this changes is not needing to pay human beings for work.

dtquad · 8 months ago

Instead of calling for regulations, the big tech companies should run big campaigns educating the public, especially boomers, that they no longer can trust images, videos, and audio on the Internet. Put paid articles and ads about this in local newspapers around the world so even the least online people gets educated about this.

WickyNilliams · 8 months ago

Do we really want a world where we can't trust anything we see, hear, or read? Where people need to be educated to not trust their senses, the things we use to interpret reality and the world around us.

I feel this kind of hypervigilance will be mentally exhausting, and not being able to trust your primary senses will have untold psychological effects

Retr0id · 8 months ago

What would motivate "big tech" to warn people about their own products, if not regulations?

jprete · 8 months ago

Don't forget text. You can't trust text either.

And no big tech company would run the ads you're suggesting, because they only make money when people use the systems that deliver the untrustworthy content.

onel · 8 months ago

The same things could be said when everyone could print their own newspapers or books. How would people distinguish between fake and real news?

I think we will need the same healthy media diet.

dbbk · 8 months ago

There wasn't even a healthy media diet before generative AI given the amount of 'fake news' in 2016 and 2020.

golergka · 8 months ago

Photoshop has been a thing for over 30 years.

EForEndeavour · 8 months ago

Isn't the whole point of OP that we're currently watching the barrier to generating realistic assets go from "spend months grinding Photoshop tutorials" to "type what you want into this box and wait a few minutes"?

dbbk · 8 months ago

I still don't really know why we're doing this. What is the upside? Democratising Hollywood? At the expense of... enormous catastrophic disinformation and media manipulation.

ddalex · 8 months ago

The society voted with their money. Google refrained from launching their early chatbots and image generation tools due to perceived risks of unsafe and misleading content being generated, and got beaten to the punch in the market. Of course now they'll launch early and often, the market has spoken.

Retr0id · 8 months ago

We have constructed a society where market forces feel inevitable, but it doesn't have to be that way.

simonw · 9 months ago

sigmar · 9 months ago

Winning 2:1 in user preference versus sora turbo is impressive. It seems to have very similar limitations to sora. For example- the leg swapping in the ice skating video and the bee keeper picking up the jar is at a very unnatural acceleration (like it pops up). Though by my eye maybe slightly better emulating natural movement and physics in comparison to sora. The blog post has slightly more info:

>at resolutions up to 4K, and extended to minutes in length.

https://blog.google/technology/google-labs/video-image-gener...

torginus · 9 months ago

It looks Sora is actually the worst performer in the benchmarks, with Kling being the best and others not far behind.

Anyways, I strongly suspect that the funny meme content that seems to be the practical uses case of these video generators won't be possible on either Veo or Sora, because of copyright, PC, containing famous people, or other 'safety' related reasons.

jonplackett · 9 months ago

I’ve been using Kling a lot recently and been really impressed, especially by 1.5.

I was so excited to see Sora out - only to see it has most of the same problems. And Kling seems to do better in a lot of benchmarks.

I can’t quite make sense of it - what OpenAI were showing when they first launched Sora was so amazing. Was it cherry picked? Or was it using loads more compute than what they’ve release?

BugsJustFindMe · 9 months ago

> the jar is at a very unnatural acceleration (like it pops up).

It does pop up. Look at where his hand is relative to the jar when he grabs it vs when he stops lifting it. The hand and the jar are moving, but the jar is non-physically unattached to the grab.

lukol · 9 months ago

Last time Google made a big Gemini announcement, OpenAI owned them by dropping the Sora preview shortly after.

This feels like a bit of a comeback as Veo 2 (subjectively) appears to be a step up from what Sora is currently able to achieve.

htrp · 9 months ago

Some PM is literally sitting on this release waiting for their benchmarks to finish

esafak · 8 months ago

And it's going to be hard for OpenAI to do that again, now that Google's woken up.

jasonjmcghee · 9 months ago

I appreciate they posted the skateboarding video. Wildly unrealistic whenever he performs a trick - just morphing body parts.

Some of the videos look incredibly believable though.

visnup · 9 months ago

our only hope for verifying truth in the future is that state officials give their speeches while doing kick flips and frontside 360s.

stabbles · 9 months ago

sadly it's likely that video gen models will master this ability faster than state officials

markus_zhang · 9 months ago

Maybe they will do more in person talks, I guess. Back to the old times.

throw4321 · 8 months ago

What officials actually say doesn't make a difference anymore. People do not get bamboozled because of lack of facts. People who get bamboozled are past facts.

kaonwarb · 9 months ago

This was my favorite of all of the videos. There's no uncanny valley; it's openly absurd, and I watched it 4-5 times with increasing enjoyment.

bahmboo · 9 months ago

Cracks in the system are often places where artists find the new and interesting. The leg swapping of the ice skater is mesmerizing in its own way. It would be useful to be able to direct the models in those directions.

johndough · 9 months ago

It is great so see a limitations section. What would be even more honest is a very large list of videos generated without any cherry picking to judge the expected quality for the average user. Anyway, the lack of more videos suggests that there might be something wrong somewhere.

dyauspitr · 9 months ago

The honey, Peruvian women, swimming dog, bee keeper, DJ etc. are stunning. They’re short but I can barely find any artifacts.

__float · 9 months ago

The prompt for the honey video mentions ending with a shot of an orange. The orange just...isn't there, though?

mattigames · 9 months ago

Just pretend it's a movie about a shape shifter alien and it's just trying it's best at ice skating, art is subjective like that doesn't it? I bet Salvador Dali would have found those morphing body parts highly amusing.

cyv3r · 9 months ago

I don't know why they say the model understands physics when it makes mistakes like that still.

0xcb0 · 8 months ago

veryrealsid · 9 months ago

FWIW it feels like Google should dominate text/image -> video since they have access to Youtube unfettered. Excited to see what the reception is here.

paxys · 9 months ago

Everyone has access to YouTube. It’s safe to assume that Sora was trained on it as well.

Jeff_Brown · 9 months ago

All you can eat? Surely they charge a lot for that, at least. And how would you even find all the videos?

bangaladore · 9 months ago

Does everyone have "legal" access to YouTube.

In theory that should matter to something like Open(Closed)Ai. But who knows.

hirako2000 · 9 months ago

They also had a good chunk of the web text indexed, millions of people's email sent every day, Google scholar papers, the massive Google books that digitized most ever published books and even discovered transformers.

fernly · 8 months ago

Superficially impressive but what is the actual use case of the present state of the art? It makes 10-second demos, fine. But can a producer get a second shot of the same scene and the same characters, with visual continuity? Or a third, etc? In other words, can it be used to create a coherent movie --even a 60-second commercial -- with multiple shots having continuity of faces, backgrounds, and lighting?

This quote suggests not: "maintaining complete consistency throughout complex scenes or those with complex motion, remains a challenge."

okdood64 · 8 months ago

B-roll for YouTube videos.

hersko · 8 months ago

This is still early. It's only going to get better.

becquerel · 8 months ago

Fun. Fun! I find it a lot of fun to have a computer spit out pixels based on silly ideas I have. It is very amusing to me

m3kw9 · 8 months ago

You blend them and extend the videos and then you connect enough for a 2 min short

That's what I think the tech at this stage cannot do. You make two clips from the same prompt with a minor change, e.g.

> a thief threatens a man with a gun, demanding his money, then fires the gun (etc add details)

> the thief runs away, while his victim slowly collapses on the sidewalk (etc same details)

Would you get the same characters, wearing the identical clothing, the same lighting and identical background details? You need all these elements to be the same, that's what filmmakers call "continuity". I doubt that Veo or any of the generators would actually produce continuity.

sdenton4 · 8 months ago

Dank memes.

exodust · 8 months ago

> "what is the actual use case of the art?"

Not much. Low quality over-saturated advertising? Short films made by untalented lazy filmmakers?

When text prompts are the only source, creativity is absent. No craft, no art. Audiences won't gravitate towards fake crap that oozes out of AI vending machines, unrefined, artistically uncontrolled.

Imagine visiting a restaurant because you heard the chef is good. You enjoy your meal but later discover the chef has a "food generator" where he prompts the food into existence. Would you go back to that restaurant?

There's one exception. Video-to-video and image-to-video, where your own original artwork, photos, drawings and videos are the source of the generated output. Even then, it's like outsourcing production to an unpredictable third party. Good luck getting lighting and details exactly right.

I see the role of this AI gen stuff as background filler, such as populating set details or distant environments via green screen.

eddd-ddde · 8 months ago

> Imagine visiting a restaurant because you heard the chef is good. You enjoy your meal but later discover the chef has a "food generator" where he prompts the food into existence. Would you go back to that restaurant?

That's an obvious yes from me. I liked it, and not only that, but I can reasonably assume it will be consistently good in the future, something lot's of places can't do.

AuthConnectFail · 8 months ago

short video creation tools, its a huge market

gloflo · 8 months ago

Misinformation

xnx · 9 months ago

This looks great, but I'm confused by this part:

> Veo sample duration is 8s, VideoGen’s sample duration is 10s, and other models' durations are 5s. We show the full video duration to raters.

Could the positive result for Veo 2 mean the raters like longer videos? Why not trim Veo 2's output to 5s for a better controlled test?

I'm not surprised this isn't open to the public by Google yet, there's a huge amount of volunteer red-teaming to be done by the public on other services like hailuoai.video yet.

P.S. The skate tricks in the final video are delightfully insane.

echelon · 9 months ago

> I'm not surprised this isn't open to the public by Google yet,

Closed models aren't going to matter in the long run. Hunyuan and LTX both run on consumer hardware and produce videos similar in quality to Sora Turbo, yet you can train them and prompt them on anything. They fit into the open source ecosystem which makes building plugins and controls super easy.

Video is going to play out in a way that resembles images. Stable Diffusion and Flux like players will win. There might be room for one or two Midjourney-type players, but by and large the most activity happens in the open ecosystem.

sorenjan · 9 months ago

> Hunyuan and LTX both run on consumer hardware

Are there other versions than the official?

> An NVIDIA GPU with CUDA support is required. > Recommended: We recommend using a GPU with 80GB of memory for better generation quality.

https://github.com/Tencent/HunyuanVideo

> I am getting CUDA out of memory on an Nvidia L4 with 24 GB of VRAM, even after using the bfloat16 optimization.

https://github.com/Lightricks/LTX-Video/issues/64

WillyWonkaJr · 9 months ago

I wonder if the more decisive aspect is the data, not the model. Will closed data win over open data?

With the YouTube corpus at their disposal, I don't see how anyone can beat Google for AI video generation.

Stable Diffusion and Flux did not win though. Midjourney and chatGPT won.