I got access to the preview, here's what it gave me for "A pelican riding a bicycle along a coastal path overlooking a harbor" - this video has all four versions shown:
Of the four two were a pelican riding a bicycle. One was a pelican just running along the road, one was a pelican perched on a stationary bicycle, and one had the pelican wearing a weird sort of pelican bicycle helmet.
There's another important contender in the space: Hunyuan model from Tencent
My company (Nim) is hosting Hunyuan model, so here's a quick test (first attempt) at "pelican riding a bycicle" via Hunyuan on Nim:
https://nim.video/explore/OGs4EM3MIpW8
I think it's as good, if not better than Sora / Veo
> A whimsical pelican, adorned in oversized sunglasses and a vibrant, patterned scarf, gracefully balances on a vintage bicycle, its sleek feathers glistening in the sunlight. As it pedals joyfully down a scenic coastal path, colorful wildflowers sway gently in the breeze, and azure waves crash rhythmically against the shore. The pelican occasionally flaps its wings, adding a playful touch to its enchanting ride. In the distance, a serene sunset bathes the landscape in warm hues, while seagulls glide gracefully overhead, celebrating this delightful and lighthearted adventure of a pelican enjoying a carefree day on two wheels.
What does it produce for “A pelican riding a bicycle along a coastal path overlooking a harbor”?
Or, what do Sora and Veo produce for your verbose prompt?
Hard to say about SORA but the video you shared is most definitely worse than Veo.
The Pelican is doing some weird flying motion, motion blur is hiding a lack of detail, cycle is moving fast so background is blurred etc. I would even say SORA is better because I like the slow-motion and detail but it did do something very non physical.
Veo is clearly the best in this example. It has high detail but also feels the most physically grounded among the examples.
If you'd like to replicate, the sign-up process was very easy and I was easily able to run a single generation attempt. Maybe later when I want to generate video I'll use prompt enhancement. Without it, the video appears to have lost a notion of direction. Most image-generation models I'm aware of do prompt-enhancement. I've seen it on Grok+Flow/Aurora and ChatGPT+DallE.
Prompt
A pelican riding a bicycle along a coastal path overlooking a harbor
Seed
15185546
Resolution
720×480
As long as at least one option is exactly what you asked for throwing variations at you that don't conform to 100% of your prompt seems like it could be useful if it gives the model leeway to improve the output in other aspects.
I am surprised that the top/right one still shows a cut and switch to a difference scene. I would assume that that's something that could be trivially filtered out of the training data, as those discontinuities don't seem to be useful for either these short 6sec video segments or for getting an understanding of the real world.
Well yeah, if you look closely at the example videos on the site, one of them is not quite right either:
> Prompt: The sun rises slowly behind a perfectly plated breakfast scene. Thick, golden maple syrup pours in slow motion over a stack of fluffy pancakes, each one releasing a soft, warm steam cloud. A close-up of crispy bacon sizzles, sending tiny embers of golden grease into the air. [...]
In the video, the bacon is unceremoniously slapped onto the pancakes, while the prompt sounds like it was intended to be a separate shot, with the bacon still in the pan? Or, alternatively, everything described in the prompt should have been on the table at the same time?
So, yet again: AI produces impressive results, but it rarely does exactly what you wanted it to do...
Technically speaking I'd say your expectation is definitely not laid out in the prompt, so anything goes. Believe me I've had such requirements from users and me as a mere human programmer am never quite sure what they actually want. So I take guesses just like the AI (because simply asking doesn't bring you very far, you must always show something) and take it from there. In other words, if AI works like me, I can pack my stuff already.
This tech is cute but the only viable outcomes are going to be porn and mass produced slop that'll be uninteresting before it's even created. Why even bother?
But I'm also seeing some genuinely creative uses of generative video - stuff I could argue has got some genuine creative validity. I am loathe to dismiss an entire technique because it is mostly used to create garbage.
We'll have to figure out how to solve the slop problem - it was already an issues before AI so maybe this is just hastening the inevevitable.
Comments like this one are so predictable and incredulous. As if the current state of the art is the final form of this technology. This is just getting started. Big facepalm.
Winning 2:1 in user preference versus sora turbo is impressive. It seems to have very similar limitations to sora. For example- the leg swapping in the ice skating video and the bee keeper picking up the jar is at a very unnatural acceleration (like it pops up). Though by my eye maybe slightly better emulating natural movement and physics in comparison to sora. The blog post has slightly more info:
>at resolutions up to 4K, and extended to minutes in length.
It looks Sora is actually the worst performer in the benchmarks, with Kling being the best and others not far behind.
Anyways, I strongly suspect that the funny meme content that seems to be the practical uses case of these video generators won't be possible on either Veo or Sora, because of copyright, PC, containing famous people, or other 'safety' related reasons.
I’ve been using Kling a lot recently and been really impressed, especially by 1.5.
I was so excited to see Sora out - only to see it has most of the same problems. And Kling seems to do better in a lot of benchmarks.
I can’t quite make sense of it - what OpenAI were showing when they first launched Sora was so amazing. Was it cherry picked? Or was it using loads more compute than what they’ve release?
> the jar is at a very unnatural acceleration (like it pops up).
It does pop up. Look at where his hand is relative to the jar when he grabs it vs when he stops lifting it. The hand and the jar are moving, but the jar is non-physically unattached to the grab.
What officials actually say doesn't make a difference anymore. People do not get bamboozled because of lack of facts. People who get bamboozled are past facts.
Cracks in the system are often places where artists find the new and interesting. The leg swapping of the ice skater is mesmerizing in its own way. It would be useful to be able to direct the models in those directions.
It is great so see a limitations section. What would be even more honest is a very large list of videos generated without any cherry picking to judge the expected quality for the average user. Anyway, the lack of more videos suggests that there might be something wrong somewhere.
Just pretend it's a movie about a shape shifter alien and it's just trying it's best at ice skating, art is subjective like that doesn't it? I bet Salvador Dali would have found those morphing body parts highly amusing.
Imho is stunning, yet what is happening there is super dangerous.
These videos will and may be too realistic.
Our society is not prepared for this kind of reality "bending" media. These hyperrealistic videos will be the reason for hate and murder. Evil actors will use it to influence elections on a global scale. Create cults around virtual characters. Deny the rules of physics and human reason. And yet, there is no way for a person to detect instantly that he is watching a generated video. Maybe now, but in 1 year, it will be indistinguishable from a real recorded video
Are Apple and other phone/camera makers working on ways to "sign" a video to say it's an unedited video from a camera? Does this exist now? Is it possible?
I'm thinking of simple cryptographic signing of a file, rather than embedding watermarks into the content, but that's another option.
I don't think it will solve the fake video onslaught, but it could help.
Cute hack showing that its kinda useless unless the user-facing UX does a better job of actually knowing whether the certificate represents the manufacturer of the sensor (dude just uses a self signed cert with "Leica Camera AG" as the name. Clearly cryptography literacy is lagging behind...
https://hackaday.com/2023/11/30/falsified-photos-fooling-ado...
Nikon has had digital signature ability in some of their flagship cameras since at least 2007, and maybe before then. The feature is used by law enforcement when documenting evidence. I assume other brands also have this available for the same reasons.
Which take millions of dollars and huge teams to make. These take one bored person, a sentence, and a few minutes to go from idea to posting on social media. That difference is the entire concern.
We already have hate and murder, evil actors influencing elections on a global scale, denial of physics and reason, and cults of personality. We also already have the ability to create realistic videos - not that it matters because for many people the bar of credulity isn't realism but simply confirming their priors. We already live in a world where TikTok memes and Facebook are the primary sources around which the masses base their reality, and that shit doesn't even take effort.
The only thing this changes is not needing to pay human beings for work.
Instead of calling for regulations, the big tech companies should run big campaigns educating the public, especially boomers, that they no longer can trust images, videos, and audio on the Internet. Put paid articles and ads about this in local newspapers around the world so even the least online people gets educated about this.
Do we really want a world where we can't trust anything we see, hear, or read? Where people need to be educated to not trust their senses, the things we use to interpret reality and the world around us.
I feel this kind of hypervigilance will be mentally exhausting, and not being able to trust your primary senses will have untold psychological effects
And no big tech company would run the ads you're suggesting, because they only make money when people use the systems that deliver the untrustworthy content.
Isn't the whole point of OP that we're currently watching the barrier to generating realistic assets go from "spend months grinding Photoshop tutorials" to "type what you want into this box and wait a few minutes"?
I still don't really know why we're doing this. What is the upside? Democratising Hollywood? At the expense of... enormous catastrophic disinformation and media manipulation.
The society voted with their money. Google refrained from launching their early chatbots and image generation tools due to perceived risks of unsafe and misleading content being generated, and got beaten to the punch in the market. Of course now they'll launch early and often, the market has spoken.
They also had a good chunk of the web text indexed, millions of people's email sent every day, Google scholar papers, the massive Google books that digitized most ever published books and even discovered transformers.
Superficially impressive but what is the actual use case of the present state of the art? It makes 10-second demos, fine. But can a producer get a second shot of the same scene and the same characters, with visual continuity? Or a third, etc? In other words, can it be used to create a coherent movie --even a 60-second commercial -- with multiple shots having continuity of faces, backgrounds, and lighting?
This quote suggests not: "maintaining complete consistency throughout complex scenes or those with complex motion, remains a challenge."
That's what I think the tech at this stage cannot do. You make two clips from the same prompt with a minor change, e.g.
> a thief threatens a man with a gun, demanding his money, then fires the gun (etc add details)
> the thief runs away, while his victim slowly collapses on the sidewalk (etc same details)
Would you get the same characters, wearing the identical clothing, the same lighting and identical background details? You need all these elements to be the same, that's what filmmakers call "continuity". I doubt that Veo or any of the generators would actually produce continuity.
Not much. Low quality over-saturated advertising? Short films made by untalented lazy filmmakers?
When text prompts are the only source, creativity is absent. No craft, no art. Audiences won't gravitate towards fake crap that oozes out of AI vending machines, unrefined, artistically uncontrolled.
Imagine visiting a restaurant because you heard the chef is good. You enjoy your meal but later discover the chef has a "food generator" where he prompts the food into existence. Would you go back to that restaurant?
There's one exception. Video-to-video and image-to-video, where your own original artwork, photos, drawings and videos are the source of the generated output. Even then, it's like outsourcing production to an unpredictable third party. Good luck getting lighting and details exactly right.
I see the role of this AI gen stuff as background filler, such as populating set details or distant environments via green screen.
> Imagine visiting a restaurant because you heard the chef is good. You enjoy your meal but later discover the chef has a "food generator" where he prompts the food into existence. Would you go back to that restaurant?
That's an obvious yes from me. I liked it, and not only that, but I can reasonably assume it will be consistently good in the future, something lot's of places can't do.
> Veo sample duration is 8s, VideoGen’s sample duration is 10s, and other models' durations are 5s. We show the full video duration to raters.
Could the positive result for Veo 2 mean the raters like longer videos? Why not trim Veo 2's output to 5s for a better controlled test?
I'm not surprised this isn't open to the public by Google yet, there's a huge amount of volunteer red-teaming to be done by the public on other services like hailuoai.video yet.
P.S. The skate tricks in the final video are delightfully insane.
> I'm not surprised this isn't open to the public by Google yet,
Closed models aren't going to matter in the long run. Hunyuan and LTX both run on consumer hardware and produce videos similar in quality to Sora Turbo, yet you can train them and prompt them on anything. They fit into the open source ecosystem which makes building plugins and controls super easy.
Video is going to play out in a way that resembles images. Stable Diffusion and Flux like players will win. There might be room for one or two Midjourney-type players, but by and large the most activity happens in the open ecosystem.
https://static.simonwillison.net/static/2024/pelicans-on-bic...
Of the four two were a pelican riding a bicycle. One was a pelican just running along the road, one was a pelican perched on a stationary bicycle, and one had the pelican wearing a weird sort of pelican bicycle helmet.
All four were better than what I got from Sora: https://simonwillison.net/2024/Dec/9/sora/
My company (Nim) is hosting Hunyuan model, so here's a quick test (first attempt) at "pelican riding a bycicle" via Hunyuan on Nim: https://nim.video/explore/OGs4EM3MIpW8
I think it's as good, if not better than Sora / Veo
What does it produce for “A pelican riding a bicycle along a coastal path overlooking a harbor”?
Or, what do Sora and Veo produce for your verbose prompt?
The Pelican is doing some weird flying motion, motion blur is hiding a lack of detail, cycle is moving fast so background is blurred etc. I would even say SORA is better because I like the slow-motion and detail but it did do something very non physical.
Veo is clearly the best in this example. It has high detail but also feels the most physically grounded among the examples.
If you'd like to replicate, the sign-up process was very easy and I was easily able to run a single generation attempt. Maybe later when I want to generate video I'll use prompt enhancement. Without it, the video appears to have lost a notion of direction. Most image-generation models I'm aware of do prompt-enhancement. I've seen it on Grok+Flow/Aurora and ChatGPT+DallE.
Turning content blockers off does not make a difference.
Deleted Comment
oh, shit!
https://hailuoai.video/share/N9dlRd1L1o0p
> Prompt: The sun rises slowly behind a perfectly plated breakfast scene. Thick, golden maple syrup pours in slow motion over a stack of fluffy pancakes, each one releasing a soft, warm steam cloud. A close-up of crispy bacon sizzles, sending tiny embers of golden grease into the air. [...]
In the video, the bacon is unceremoniously slapped onto the pancakes, while the prompt sounds like it was intended to be a separate shot, with the bacon still in the pan? Or, alternatively, everything described in the prompt should have been on the table at the same time?
So, yet again: AI produces impressive results, but it rarely does exactly what you wanted it to do...
But I'm also seeing some genuinely creative uses of generative video - stuff I could argue has got some genuine creative validity. I am loathe to dismiss an entire technique because it is mostly used to create garbage.
We'll have to figure out how to solve the slop problem - it was already an issues before AI so maybe this is just hastening the inevevitable.
>at resolutions up to 4K, and extended to minutes in length.
https://blog.google/technology/google-labs/video-image-gener...
Anyways, I strongly suspect that the funny meme content that seems to be the practical uses case of these video generators won't be possible on either Veo or Sora, because of copyright, PC, containing famous people, or other 'safety' related reasons.
I was so excited to see Sora out - only to see it has most of the same problems. And Kling seems to do better in a lot of benchmarks.
I can’t quite make sense of it - what OpenAI were showing when they first launched Sora was so amazing. Was it cherry picked? Or was it using loads more compute than what they’ve release?
It does pop up. Look at where his hand is relative to the jar when he grabs it vs when he stops lifting it. The hand and the jar are moving, but the jar is non-physically unattached to the grab.
This feels like a bit of a comeback as Veo 2 (subjectively) appears to be a step up from what Sora is currently able to achieve.
Deleted Comment
Some of the videos look incredibly believable though.
These videos will and may be too realistic.
Our society is not prepared for this kind of reality "bending" media. These hyperrealistic videos will be the reason for hate and murder. Evil actors will use it to influence elections on a global scale. Create cults around virtual characters. Deny the rules of physics and human reason. And yet, there is no way for a person to detect instantly that he is watching a generated video. Maybe now, but in 1 year, it will be indistinguishable from a real recorded video
I'm thinking of simple cryptographic signing of a file, rather than embedding watermarks into the content, but that's another option.
I don't think it will solve the fake video onslaught, but it could help.
Cute hack showing that its kinda useless unless the user-facing UX does a better job of actually knowing whether the certificate represents the manufacturer of the sensor (dude just uses a self signed cert with "Leica Camera AG" as the name. Clearly cryptography literacy is lagging behind... https://hackaday.com/2023/11/30/falsified-photos-fooling-ado...
Dead Comment
The only thing this changes is not needing to pay human beings for work.
I feel this kind of hypervigilance will be mentally exhausting, and not being able to trust your primary senses will have untold psychological effects
And no big tech company would run the ads you're suggesting, because they only make money when people use the systems that deliver the untrustworthy content.
I think we will need the same healthy media diet.
In theory that should matter to something like Open(Closed)Ai. But who knows.
This quote suggests not: "maintaining complete consistency throughout complex scenes or those with complex motion, remains a challenge."
> a thief threatens a man with a gun, demanding his money, then fires the gun (etc add details)
> the thief runs away, while his victim slowly collapses on the sidewalk (etc same details)
Would you get the same characters, wearing the identical clothing, the same lighting and identical background details? You need all these elements to be the same, that's what filmmakers call "continuity". I doubt that Veo or any of the generators would actually produce continuity.
Deleted Comment
Not much. Low quality over-saturated advertising? Short films made by untalented lazy filmmakers?
When text prompts are the only source, creativity is absent. No craft, no art. Audiences won't gravitate towards fake crap that oozes out of AI vending machines, unrefined, artistically uncontrolled.
Imagine visiting a restaurant because you heard the chef is good. You enjoy your meal but later discover the chef has a "food generator" where he prompts the food into existence. Would you go back to that restaurant?
There's one exception. Video-to-video and image-to-video, where your own original artwork, photos, drawings and videos are the source of the generated output. Even then, it's like outsourcing production to an unpredictable third party. Good luck getting lighting and details exactly right.
I see the role of this AI gen stuff as background filler, such as populating set details or distant environments via green screen.
That's an obvious yes from me. I liked it, and not only that, but I can reasonably assume it will be consistently good in the future, something lot's of places can't do.
> Veo sample duration is 8s, VideoGen’s sample duration is 10s, and other models' durations are 5s. We show the full video duration to raters.
Could the positive result for Veo 2 mean the raters like longer videos? Why not trim Veo 2's output to 5s for a better controlled test?
I'm not surprised this isn't open to the public by Google yet, there's a huge amount of volunteer red-teaming to be done by the public on other services like hailuoai.video yet.
P.S. The skate tricks in the final video are delightfully insane.
Closed models aren't going to matter in the long run. Hunyuan and LTX both run on consumer hardware and produce videos similar in quality to Sora Turbo, yet you can train them and prompt them on anything. They fit into the open source ecosystem which makes building plugins and controls super easy.
Video is going to play out in a way that resembles images. Stable Diffusion and Flux like players will win. There might be room for one or two Midjourney-type players, but by and large the most activity happens in the open ecosystem.
Are there other versions than the official?
> An NVIDIA GPU with CUDA support is required. > Recommended: We recommend using a GPU with 80GB of memory for better generation quality.
https://github.com/Tencent/HunyuanVideo
> I am getting CUDA out of memory on an Nvidia L4 with 24 GB of VRAM, even after using the bfloat16 optimization.
https://github.com/Lightricks/LTX-Video/issues/64
With the YouTube corpus at their disposal, I don't see how anyone can beat Google for AI video generation.