The first thing I will do when I get access to this is ask it to generate a realistic chess board. I have never gotten a decent looking chessboard with any image generator that doesn't have deformed pieces, the correct number of squares, squares properly in a checkerboard pattern, pieces placed in the correct position, board oriented properly (white on the right!) and not an otherwise illegal position. It seems to be an "AI complete" problem.
Similarly the Veo example of the northern lights is a really interesting one. That's not what the northern lights look like to the naked eye - they're actually pretty grey. The really bright greens and even the reds really only come out when you take a photo of them with a camera. Of course the model couldn't know that because, well, it only gets trained on photos. Gets really existential - simulacra energy - maybe another good AI Turing test, for now.
Human eyes are basically black and white in low light since rod cells can't detect color. But when the northern lights are bright enough you can definitely see the colors.
The fact that some things are too dark to be seen by humans but can be captured accurately with cameras doesn't mean that the camera, or the AI, is "making things up" or whatever.
Finally, nobody wants to see a video or a photo of a dark, gray, and barely visible aurora.
I can see what you mean, and that the video is somewhat not what it would be like in real. I have lived in northern Norway most of my life, and watched Auroras a lot. It certainly look green and link for the most time. Fainter, it would perhaps sorry gray I guess? Red, when viewed from a more southern viewpoint..
I work at Andøya Space where perhaps most of the space research on Aurora had been done by sending scientific rockets into space for the last 60 yrs.
That not true, they look grey when they aren't bright enough, but they can look green or red to the naked eyes if they are bright. I have seen it myself and yes I was disappointed to see only grey ones last week.
To be fair, the prompt isn’t asking for a realistic interpretation it’s asking for a timelapse. What it’s generated is absolutely what most timelapses look like.
> Prompt: Timelapse of the northern lights dancing across the Arctic sky, stars twinkling, snow-covered landscape
That doesn't seem in any way useful, though... To use a very blunt analogy, are color blind people intelligent/sentient/whatever? Obviously, yes: differences in perceptual apparatus aren't useful indicators of intelligence.
For decades, game engines have been working on realistic rendering. Bumping quality here and there.
The golden standard for rendering has always been cameras. It’s always photo-realistic rendering. Maybe this won’t be true for VR, but so far most effort is to be as good as video, not as good as the human eye.
Any sort of video generation AI is likely to have the same goal. Be as good as top notch cameras, not as eyes.
What struck me about the northern lights video was that it showed the Milky Way crossing the sky behind the northern lights. That bright part of the Milky Way is visible in the southern sky but the aurora hugging the horizon like that indicates the viewer is looking north. (Swap directions for the southern hemisphere and the aurora borealis).
that's a bad example since the only images of aurora borealis are brightly colored. What I expect of an image generator is to output what is expected from it
Ha, wow, I’d never seen this one before. The failures are pretty great. Even repeatedly trying to correct ChatGPT/Dall-e with the proper number of squares and pieces, it somehow makes it worse.
This is what dall-e came up with after trying to correct many previous iterations: https://imgur.com/Ss4TwNC
As someone who criticizes AI a lot: this actually looks pretty cool! AI is not better at surrealism than a good artist, but at least its work is enjoyable as a surreal art. Justifies the name Dall-e pretty well too.
This strikes me as equally "AI complete" as drawing hands, which is now essentially a solved problem... No one test is sufficient, because you can add enough training data to address it.
Tiring, but so is the relentless over-marketing. Each new demo implies new use cases and flexible performance. But the reality is they're very brittle and blunder most seemingly simple tasks. I would personally love an ongoing breakdown of the key weaknesses. I often wonder "can it X?" The answer is almost always "almost, but not a useful almost".
Most generative AI will struggle when given a task that requires something more less exact. They're probably pretty good at making something "chessish".
Conventionally this term means the opposite -- problems that AI unlocks that conventional computing could not do. Conventional computing can render a very wide range of different stylized chess boards, but when an ML technique like diffusion is applied to this mundane problem, it falls apart.
Mine is generation of any actual IBM PC/XT computer. All of the training sets either didn't include actual IBM PCs in them, or they labeled all PC compatibles "IBM PC". Whatever the reason, no generative AI today, whether commercial or open-source, can generate any picture of an IBM PC 5150. Once that situation improves, I'll start taking notice.
I would like a bit more convincing that the text watermark will not be noticeable. AI text already has issues with using certain words to frequently. Messing with the weights seems like it might make the issue worse
Not to mention, when does he get applied? If I am asking an llm to transform some data from one format to another, I don't expect any changes other than the format.
It seems really clever, especially the encoding of a signature into LLM token probability selections. I wonder if synthid will trigger some standarization in the industry. I don't think there's much incentive to tho. Open-source gen AI will still exist. What does google expext to occur? I guess they're just trying to present themselves as 'ethically pursuing AI'.
From a filmmaking standpoint I still don't think this is impactful.
For that it needs a "director" to say: "turn the horse's head 90˚ the other way, trot 20 feet, and dismount the rider" and "give me additional camera angles" of the same scene.
Otherwise this is mostly b-roll content.
That sounds actively harmful. Often we want story boards to be less specific so as not to have some non artist decision maker ask why it doesn't look like the storyboard.
And when we want it to match exactly in an animatic or whatever, it needs to be far more precise than this, matching real locations etc.
Perhaps the only industry which immediately benefits from this is the short ads and perhaps TikTok. But still it is very dubious, as people seem to actually enjoy being themselves the directors of their thing, not somebody else.
Maybe this works for ads for duner place or shisha bar in some developing country.
I’ve seen generated images used for menus in such places.
But I doubt a serious filmography can be done this way. And if it can - it’d be again thanks to some smart concept on behalf of humans.
Stock videos are indeed crucial, especially now that we can easily search for precisely what we need. Take, for instance, the scene at the end of 'Look Up' featuring a native American dance in Peru. The dancer's movements were captured from a stock video, and the comet falling was seamlessly edited in.
now imagine having near infinite stock videos tailored to the situation.
Stock photographers are already having issues with piracy due to very powerful AI watermark removal tools. And I suspect the companies are using content of these people to train these models too.
.
I dont think "turn the horse's head 90˚" is the right path forward. What I think is more likely and more useful is: here is a start keyframe and here is a stop keyframe (generated by text to image using other things like controlnet to control positioning etc.) and then having the AI generate the frames in between. Dont like the way it generated the in between? Choose a keyframe, adjust it, and rerun with the segment before and segment after.
This appeals to me because it feels auditable and controllable... But the pace these things have been progressing the last 3 years, I could imagine the tech leapfrogs all conventional understanding real soon. Likely outputting gaussian splat style outputs where the scene is separate from the camera and ask peices can be independently tweaked via a VR director chair
They claim it can accept an "input video and editing command" to produce a new video output. Also, "In addition, it supports masked editing, enabling changes to specific areas of the video when you add a mask area to your video and text prompt." Not sure if that specific example would work or not.
For most things I view on the internet B-roll is great content, so I'm sure this will enable a new kind of storytelling via YouTube Shorts / Instagram, etc at minimum.
I wouldn't be so sure it's coming. NNs currently dont have the structures for long term memory and development. These are almost certainly necessary for creating longer works with real purpose and meaning. It's possible we're on the cusp with some of the work to tame RNNs, but it's taken us years to really harness the power of transformers.
There's also the whole "oh you have no actual model/rigging/lighting/set to manipulate" for detail work issue.
That said, I personally think the solution will not be coming that soon, but at the same time, we'll be seeing a LOT more content that can be done using current tools, even if that means a dip in quality (severely) due to the cost it might save.
This lead me to the question of why hasn't there been an effort to do this with 3D content (that I know of).
Because camera angles/lighting/collision detection/etc. at that point would be almost trivial.
I guess with the "2D only" approach that is based on actual, acquired video you get way more impressive shots.
But the obvious application is for games. Content generation in the form of modeling and animation is actually one the biggest cost centers for most studios these days.
I think with AI content, we'd need to not treat it like expecting fine grained control. E.g. instead like "dramatic scene of rider coming down path, and dismounting horse, then looking into distance", etc. (Or even less detail eventually once a cohesive story can be generated.)
HN has always been notoriously negative, and wrong a lot of the time. One of my personal favorites is Brian Armstrong's post about an exciting new company he was starting around cryptocurrency and needing a co-founder... Always a good one to go back and read when I've been staying up late working on side projects and need a mental boost.
Yeah, I've made a lot of images, and it sure is amazing if all you're interested in is, like, "Any basically good image," but if you start needing something very particular, rather than "anything that is on a general topic and is aesthetically pleasing," it gets a lot harder.
And there are a lot more degrees of freedom to get something wrong in film than in a single still image.
I can't wait what will the big video camera makers gonna do with tech similar to this. Since Google clearly have zero idea what to do with this, and they lack the creativity, it's up to ARRI, Canon, Panasonic etc. to create their own solutions for this tech. I can't wait to see what Canon has up its sleeves with their new offerings that come in a few months.
The videos in this demo are pretty neat. If this had been announced just four months ago we'd all be very impressed by the capabilities.
The problem is that these video clips are very unimpressive compared to the Sora demonstration which came out three months ago. If this demo was announced by some scrappy startup it would be worth taking note. Coming from Google, the inventor of the Transformer and owner of the largest collection of videos in the world, these sample videos are underwhelming.
Having said that, Sora isn't publicly available yet, and maybe Veo will have more to offer than what we see in those short clips when it gets a full release.
They didn't really do a very good job of selecting marketing examples. The only good one, that shows off creative possibilities, is the knit elephant. Everything else looks like the results of a (granted fairly advanced) search through a catalog of stock footage.
Even search, in and of itself, is incredibly amazing but fairly commoditized at this point. They should've highlighted more unique footage.
The faster the tech cycle, the faster we become accustomed to it. Look at your phone, an absolute, wondrous marvel of technology that would have been utterl and totally scifi just 25 years ago. Yet we take it for granted, as we do with all technology eventually. The time frames just compress is all, for better or for worse.
On some level, it's healthy to retain a sense of humility at the technological marvels around us. Everything about our daily lives is impressive.
Just a few years ago, I would have been absolutely blown away by these demo videos. Six months ago, I would have been very impressed. Today, Google is rolling a product that seems second best. They're playing catch-up in a game where they should be leading.
I will still be very impressed to see videos of that quality generated on consumer grade hardware. I'll also be extremely impressed if Google manages to roll out public access to this capability without major gaffes or embarrassments.
This is very cool tech, and the developers and engineers that produced it should be proud of what they've achieved. But Google's management needs to be asking itself how they've allowed themselves to be surpassed.
Honestly, if Veo becomes public faster than Sora, they could win the video AI race. But what am I wishfully thinking - it's Google we're talking about!
> But what am I wishfully thinking - it's Google we're talking about!
Google the company known to launch way too many products? What other big company launches more stuff early than them? What people complain about Google is that they launch too much and then shut them down, not that they don't launch things.
Same impression here. The scene changes very abruptly from a sky view to following the car. The cars meld with the ground frequently, and I think I saw one car drive through another at one point.
So… much… bloom. I like it, but still holy shit. I hate that I like it because I don’t want this art form to be reduced by overuse. Sadly, it’s too late.
It's also probably that it's easier to spot fake humans than to spot fake cats or camels. We are more attuned to the faces of our own species
That is, AI humans can look "creepy" whereas AI animals may not. The cowboy looks pretty good precisely because it's all shadow.
CGI animators can probably explain this better than I can ... they have to spend way more time on certain areas and certain motions, and all the other times it makes sense to "cheat" ...
It explains why CGI characters look a certain way too -- they have to be economical to animate
Actually there is one in the last demo, it is not an individual one, but one shot in the demo where a team uses this model to create a scene with human in it, where they created an image of black woman but only up her head in it
I would generally agree though, it is not normal they didn’t show more human
Gemini still won't generate images of humans or even other hominids. They're missing here probably for the same reason. Namely that they're trying to figure out how to balance diverse representation with all the various other factors.
Not nearly as impressive as Sora. Sora was impressive because the clips were long and had lots of rapid movement since video models tend to fall apart when the movement isn't easy to predict.
By comparison, the shots here are only a few seconds long and almost all look like slow motion or slow panning shots cherrypicked because they don't have that much movement. Compare that to Sora's videos of people walking in real speed.
The only shot they had that can compare was the cyberpunk video they linked to, and it looks crazy inconsistent. Real shame.
Interesting to see that OpenAI was successful in creating their own reality distortion spells, just like Apple's reality distortion field which has fooled many of these commenters here.
It's quite early to race to the conclusion that one is better than the other when not only they are both unreleased, but especially when the demos can be edited, faked or altered to look great for optics and distortion.
EDIT: It appears there is at least one commenter who replied below that is upset with this fact above.
It is OK to cope, but the truth really doesn't care especially when the competition (Google) came out much stronger than expected with their announcements.
I believe it was clear that Air Head was an edited video.
The intention wasn't to show "This is what Sora can generate from start to end" but rather "This is what a video production team can do with Sora instead of shooting their own raw footage."
Maybe not so obvious to others, but for me it was clear from how the other demo videos looked.
> Sora was impressive because the clips were long and had lots of rapid movement
Sora videos ran at 1 beat per second, so everything in the image moved at the same beat and often too slow or too fast to keep the pace.
It is very obvious when you inspect the images and notice that there are keyframes at every whole second mark and everything on the screen suddenly goes in their next animation step.
That really limits the kind of videos you can generate.
Comparing two children is a good one. My girlfriend has taken to pointing out when I’m engaging in “punditry”. They're an engineer like I am and we talk about tech all the time, but sometimes I talk about which company is beating which company like it’s a football game, and they call me out for it.
Video models are interesting, and to some extent trying to imagine which company is gonna eat the other’s lunch is kind of interesting, but sometimes that’s all people are interested in and I can see my girlfriend's reasoning for being disinterested in such discussion.
I’m fairly certain Google just has a big stack of these in storage but never released, or the moment someone pulls ahead it’s all hands on deck to make the same thing.
Sora is also movement limited to a certain range if you look at the clips closely. Probably something like filtering by some function of optical flow in both cases.
> The shots here [..] almost all look like slow motion or slow panning shots.
I think this is arguably better than the alternative. With slow-mo generated videos, you can always speed them up in editing. It's much harder to take a fast-paced video and slow it down without terrible loss in quality.
A commercially available tool that can turn still images into depth-conscious panning shots is still tremendously impactful across all sorts of industries, especially tourism and hospitality. I’m really excited to see what this can do.
Not just that, but anything with a subject in it felt uncanny valleyish... like that cowboy clip, the gate of the horse stood out as odd and then I gave it some attention . It seems like a camel's gate. And whole thing seems to be hovering, gliding rather than walking. Sora indeed seems to have an advantage
I thought a camel's gait is much closer to two legs moving almost at the same time. Granted, I don't see camels often. Out of curiosity can you explain that more?
Could also be the doing of google. if Veo screws up , the weight falls on Alphabet stock. While open AI is not public and doesn't have to worry about anything . Like even if open AI faked some of their AI videos(not saying they did), it wouldn't affect them the way it would affect Veo--> Google-->Alphabet
From a 2014 Wired article [0]:
"The average shot length of English language films has declined from about 12 seconds in 1930 to about 2.5 seconds today"
I can see more real-world impact from this (and/or Sora) than most other AI tools
This is very noticeable. Watching movies from the 1970s is positively serene for me, vs the shot time on modern films often leaves me wonder, "wait, what just happened there?"
And I'm someone who is fine playing fast action video games. Can't imagine what it's like if you're older or have sensory processing issues.
The first time I watched The Rise of Skywalker it was just too much being thrown at my brain. The second and third watch was much easier to process of course. I'm a big fan of older movies and have noticed the shot length difference anecdotally - Lawrence of Arabia and Ben Hur are two of my favorites. So I suppose it all makes sense to me now that there is actually a comparison measurement that has been completed.
Shot length, yes - but the scene stays the same. Getting continuity with just prompts seems not yet figured out.
Maybe it's easy, and you feed continuity stills into the prompt. Maybe it's not, and this will always remain just a more advanced storyboarding technique.
But then again, storyboards are always less about details and more about mood, dialog, and framing.
How many of those 2.5 second "shots" are back-and-forths between two perspectives (ex. of two characters talking to one another) where each perspective is consistent with itself? This would be extremely relevant for how many seconds of consistent footage are actually needed for an AI-generated "shot" at film-level quality.
The fact that some things are too dark to be seen by humans but can be captured accurately with cameras doesn't mean that the camera, or the AI, is "making things up" or whatever.
Finally, nobody wants to see a video or a photo of a dark, gray, and barely visible aurora.
I work at Andøya Space where perhaps most of the space research on Aurora had been done by sending scientific rockets into space for the last 60 yrs.
see: https://theconversation.com/what-causes-the-different-colour...
I echo what some other posters here have said: they're certainly not gray.
> Prompt: Timelapse of the northern lights dancing across the Arctic sky, stars twinkling, snow-covered landscape
The golden standard for rendering has always been cameras. It’s always photo-realistic rendering. Maybe this won’t be true for VR, but so far most effort is to be as good as video, not as good as the human eye.
Any sort of video generation AI is likely to have the same goal. Be as good as top notch cameras, not as eyes.
This is what dall-e came up with after trying to correct many previous iterations: https://imgur.com/Ss4TwNC
https://www.reddit.com/r/dalle2/comments/1afhemf/is_it_possi...
https://www.reddit.com/r/dalle2/comments/1cdks71/a_hand_with...
Dead Comment
Conventionally this term means the opposite -- problems that AI unlocks that conventional computing could not do. Conventional computing can render a very wide range of different stylized chess boards, but when an ML technique like diffusion is applied to this mundane problem, it falls apart.
It seems that the SynthID is not only for AI generated video but for image, text and audio.
For that it needs a "director" to say: "turn the horse's head 90˚ the other way, trot 20 feet, and dismount the rider" and "give me additional camera angles" of the same scene. Otherwise this is mostly b-roll content.
I'm sure this is coming.
And when we want it to match exactly in an animatic or whatever, it needs to be far more precise than this, matching real locations etc.
Maybe this works for ads for duner place or shisha bar in some developing country. I’ve seen generated images used for menus in such places.
But I doubt a serious filmography can be done this way. And if it can - it’d be again thanks to some smart concept on behalf of humans.
And let the robot tween?
Vs an imperative for "tween this by turning the horse's head left"
That said, I personally think the solution will not be coming that soon, but at the same time, we'll be seeing a LOT more content that can be done using current tools, even if that means a dip in quality (severely) due to the cost it might save.
Because camera angles/lighting/collision detection/etc. at that point would be almost trivial.
I guess with the "2D only" approach that is based on actual, acquired video you get way more impressive shots.
But the obvious application is for games. Content generation in the form of modeling and animation is actually one the biggest cost centers for most studios these days.
https://news.ycombinator.com/item?id=3754664
And there are a lot more degrees of freedom to get something wrong in film than in a single still image.
Dead Comment
The problem is that these video clips are very unimpressive compared to the Sora demonstration which came out three months ago. If this demo was announced by some scrappy startup it would be worth taking note. Coming from Google, the inventor of the Transformer and owner of the largest collection of videos in the world, these sample videos are underwhelming.
Having said that, Sora isn't publicly available yet, and maybe Veo will have more to offer than what we see in those short clips when it gets a full release.
wow the speed at which we can be blasé is terrifying. 6 months ago this was not possible, and felt this was years away!
They're not underwhelming to me, they're beyond anything I thought would ever be possible.
are you genuinely unimpressed? or maybe trying to play it cool?
Even search, in and of itself, is incredibly amazing but fairly commoditized at this point. They should've highlighted more unique footage.
Just a few years ago, I would have been absolutely blown away by these demo videos. Six months ago, I would have been very impressed. Today, Google is rolling a product that seems second best. They're playing catch-up in a game where they should be leading.
I will still be very impressed to see videos of that quality generated on consumer grade hardware. I'll also be extremely impressed if Google manages to roll out public access to this capability without major gaffes or embarrassments.
This is very cool tech, and the developers and engineers that produced it should be proud of what they've achieved. But Google's management needs to be asking itself how they've allowed themselves to be surpassed.
Deleted Comment
Google the company known to launch way too many products? What other big company launches more stuff early than them? What people complain about Google is that they launch too much and then shut them down, not that they don't launch things.
I’ve switched to Opus from GPT-4 for coding and it was non-trivially easy
I’ll just go back to living under a rock.
That is, AI humans can look "creepy" whereas AI animals may not. The cowboy looks pretty good precisely because it's all shadow.
CGI animators can probably explain this better than I can ... they have to spend way more time on certain areas and certain motions, and all the other times it makes sense to "cheat" ...
It explains why CGI characters look a certain way too -- they have to be economical to animate
I would generally agree though, it is not normal they didn’t show more human
By comparison, the shots here are only a few seconds long and almost all look like slow motion or slow panning shots cherrypicked because they don't have that much movement. Compare that to Sora's videos of people walking in real speed.
The only shot they had that can compare was the cyberpunk video they linked to, and it looks crazy inconsistent. Real shame.
The most impressive Sora demo was heavily edited.
https://www.fxguide.com/fxfeatured/actually-using-sora/
https://www.youtube.com/watch?v=KFzXwBZgB88 (posted the day after the short debuted)
https://openai.com/index/sora-first-impressions (no mention of editing, nor do they link to the above making-of video)
It's quite early to race to the conclusion that one is better than the other when not only they are both unreleased, but especially when the demos can be edited, faked or altered to look great for optics and distortion.
EDIT: It appears there is at least one commenter who replied below that is upset with this fact above.
It is OK to cope, but the truth really doesn't care especially when the competition (Google) came out much stronger than expected with their announcements.
The intention wasn't to show "This is what Sora can generate from start to end" but rather "This is what a video production team can do with Sora instead of shooting their own raw footage."
Maybe not so obvious to others, but for me it was clear from how the other demo videos looked.
Sora videos ran at 1 beat per second, so everything in the image moved at the same beat and often too slow or too fast to keep the pace.
It is very obvious when you inspect the images and notice that there are keyframes at every whole second mark and everything on the screen suddenly goes in their next animation step.
That really limits the kind of videos you can generate.
I think comparing them now is probably not that useful outside of this AI hype train. Like comparing two children. A lot can happen.
The bigger message I am getting from this is it's clear OpenAI won't have a super AI monopoly.
Video models are interesting, and to some extent trying to imagine which company is gonna eat the other’s lunch is kind of interesting, but sometimes that’s all people are interested in and I can see my girlfriend's reasoning for being disinterested in such discussion.
Deleted Comment
I think this is arguably better than the alternative. With slow-mo generated videos, you can always speed them up in editing. It's much harder to take a fast-paced video and slow it down without terrible loss in quality.
It's impressive as hell though. Even if it would only be used to extrapolate existing video.
Deleted Comment
Deleted Comment
Deleted Comment
Dead Comment
being cautious often puts a dent in innovation
https://www.bbc.com/news/technology-67650807
I can see more real-world impact from this (and/or Sora) than most other AI tools
[0] https://www.wired.com/2014/09/cinema-is-evolving/
And I'm someone who is fine playing fast action video games. Can't imagine what it's like if you're older or have sensory processing issues.
I can tell what's going on, but I always end up feeling agitated.
1: https://www.youtube.com/watch?v=gCKhktcbfQM
Maybe it's easy, and you feed continuity stills into the prompt. Maybe it's not, and this will always remain just a more advanced storyboarding technique.
But then again, storyboards are always less about details and more about mood, dialog, and framing.
Just worth keeping that in mind. You could not just switch between multiple shots like you can today.
Deleted Comment