I've been testing it for several weeks. It can produce results that are truly epic, but it's still a case of rerolling the prompt a dozen times to get an image you can use. It's not God. It's definitely an enormous step though, and totally SOTA.
Is it because the model is not good enough at following the prompt, or because the prompt is unclear?
Something similar has been the case with text models. People write vague instructions and are dissatisfied when the model does not correctly guess their intentions. With image models it's even harder for model to guess it right without enough details.
Before AI, people complained that Google was taking world class engineering talent and using it for little more than selling people ads.
But look at that example. With this new frontier of AI, that world class engineering talent can finally be put to use…for product placement. We’ve come so far.
Did you think that Google would just casually allow their business to be disrupted without using the technology to improve the business and also protecting their revenue?
Both Meta and Google have indicated that they see Generative AI as a way to vertically integrate within the ad space, disrupting marketing teams, copyrighters, and other jobs who monitor or improve ad performance.
Also FWIW, I would suspect that the majority of Google engineers don't work on an ad system, and probably don't even work on a profitable product line.
Another nitpick - the pink puffer jacket that got edited into the picture is not the same as the one in the reference image - it's very similar but if I were to use this model for product placement, or cared about these sort of details, I'd definitely have issues with this.
Even in the just-photoshop-not-ai days product photos had become pretty unreliable as a means of understanding what you're buying. Of course it's much worse now.
It seems like every combination of "nano banana" is registered as a domain with their own unique UI for image generation... are these all middle actors playing credit arbitrage using a popular model name?
I'd assume they are just fake, take your money and use a different model under the hood. Because they already existed before the public release. I doubt that their backend rolled the dice on LMArena until nano-banana popped up. And that was the only way to use it until today.
Completely agree - I make logos for my github projects for fun, and the last time I tried SOTA image generation for logos, it was consistently ignoring instructions and not doing anything close to what i was asking for. Google's new release today did it near flawlessly, exactly how I wanted it, in a single prompt. A couple more prompts for tweaking (centering it, rotating it slightly) got it perfect. This is awesome.
Regardless, it seems Google is on the frontier of every type of model and robotics (cars). It’s nutty how we forget what a intellectual juggernaut they are.
I wonder how the creative workflow looks like when this kind of models are natively integrated into digital image tools. Imagine fine-grained controls on each layer and their composition with the semantic understanding on the full picture.
Before a model is announced, they use codenames on the arenas. If you look online, you can see people posting about new secret models and people trying to guess whose model it is.
“Nano banana” is probably good, given its score on the leaderboard, but the examples you show don't seem particularly impressive, it looks like what Flux Kontext or Qwen Image do well already.
I'm confused as well, I thought gpt-image could already do most of these things, but I guess the key difference is that gpt-image is not good for single point edits. In terms of "wow" factor it doesn't feel as big as gpt 3->4 though, since it sure _felt_ like models could already do this.
I've updated the GenAI Image comparison site (which focuses heavily on strict text-to-image prompt adherence) to reflect the new Google Gemini 2.5 Flash model (aka nano-banana).
This model gets 8 of the 12 prompts correct and easily comes within striking distance of the best-in-class models Imagen and gpt-image-1 and is a significant upgrade over the old Gemini Flash 2.0 model. The reigning champ, gpt-image-1, only manages to edge out Flash 2.5 on the maze and 9-pointed star.
What's honestly most astonishing to me is how long gpt-image-1 has remained at the top of the class - closing in on half a year which is basically a lifetime in this field. Though fair warning, gpt-image-1 is borderline useless as an "editor" since it almost always changes the whole image instead of doing localized inpainting-style edits like Kontext, Qwen, or Nano-Banana.
Why do Hunyuan, OpenAI 4o and Gwen get a pass for the octopus test? They don't cover "each tentacle", just some. And midjourney covers 9 of 8 arms with sock puppets.
Good point. I probably need to adjust the success pass ratios to be a bit stricter, especially as the models get better.
> midjourney covers 9 of 8 arms with sock puppets.
Midjourney is shown as a fail so I'm not sure what your point is. And those don't even look remotely close to sock puppets, they resemble stockings at best.
What's interesting is that Imagen 4 and Gemini 2.5 Flash Image look suspiciously similar in several of these tests cases. Maybe Gemini 2.5 Flash first calls Imagen in the background to get a detailed baseline image (diffusion models are good at this) and then Gemini edits the resulting image for better prompt adherence.
Yes, saw on a reddit about an employee confirming this is the case (at least on Gemini app) where the request for an image from scratch is routed to imagen and the follow-up edits are done using Gemini.
This is incredibly useful! I was manually generating my own model comparisons last night, so great to see this :)
I will note that, personally, while adherence is a useful measure, it does miss some of the qualitative differences between models. For your "spheron" test for example, you note that "4o absolutely dominated this test," but the image exhibits all the hallmarks of a ChatGPT-generated image that I personally dislike (yellow, with veiny, almost impasto brush strokes). I have stopped using ChatGPT for image generation altogether because I find the style so awful. I wonder what objective measures one could track for "style"?
It reminders be a bit of ChatGPT vs Claude for software development... Regardless of how each scores on benchmarks, Claude has been a clear winner in terms of actual results.
Yeah - unfortunately the ubiquitous "piss filter" strikes again. You pretty much have to pass GPT-image-1 through a tone map, LUT, etc. in something like Krita or Photoshop to try to mitigate this. I'm honestly a bit surprised that they haven't built this in already given how obvious the color shift is.
> Though fair warning, gpt-image-1 is borderline useless as an "editor" since it almost always changes the whole image instead of doing localized inpainting-style edits like Kontext, Qwen, or Nano-Banana.
Came into this thread looking for this post. It's a great way to compare prompt adherence across models. Have you considered adding editing capabilities in a similar way given the recent trend of inpainting-style prompting?
Adding a separate section for image editing capabilities is a great idea.
I've done some experimentation with Qwen and Kontext and been pretty impressed, but it would be nice to see some side by sides now that we have essentially three models that are capable of highly localized in-painting without affecting the rest of the image.
Unfortunately, it suffers from the same safetyism than other many releases. Half of the prompts get rejected. How can you have character consistency if the model is forbidden from editing any human. And most of my photo editing involves humans, so basically this is just a useless product. I get that Google doesn't want to be responsible for deep fake advances, but that seems inevitable, so this is just slightly delaying progress. Eventually we will have to face it and allow for society to adapt.
This trend of tools that point a finger at you and set guardrails is quite frustrating. We might need a new OSS movement to regain our freedom.
I have an old photo of my girlfriend with her cousin when they were young, wearing Christmas dresses in front of the tree, not long before they were separated to other sides of the world for decades now. The photo is itself low quality on top of the photo itself being physically beat up.
There are reddit communities (I admittedly don't remember which, but could probably be found from a simple search) where people will offer their photo editing skills to touch up the photo, often for free. Could be worth trying a real human if the robots are going full HAL 9000 and telling you they can't do it.
If you are not personally offended by looking at CRAZY pornography, you could start digging into the comfyui ecosystem. It's not all porn, there are lots of pro photo-manipulators doing sfw stuff, but the community overlap with NSFW is basically borderless, so you'll probably bump into it.
However, the results the comfyui people get are lightyears ahead of any oneshot-prompt model. Either you can find someone to do cleanup for you (should be trivial, I wouldn't pay more than $10-15) or if you have good specs for inference you could learn to do it yourself.
Open source models like Flux Kontext or Qwen image edit wouldn't refuse, but you need to either have a sufficiently strong GPU or get one in the cloud (not difficult nor expensive with services like runpod), then set up your own processing pipeline (again, not too difficult if you use ComfyUI). Results won't be SOTA, but they shouldn't be too far off.
I've done ~20 prompts so far and not had one be rejected so far. What sort of things are you asking it to do? I've tried things like changing clothing and accessories on people.
Basic things like: "{uploaded image of a man} can you remove the glasses?" or "make everyone in the picture smile" or "open the eyes of everyone in the photo". Nothing that a human would consider "unsafe". I am based in EU and using Google AI Studio with all safety toggles set to "Off".
I was using Veo two days ago when video generations were free. I removed all words that sounded even remotely bad, but it still refused. Eventually gave up but now I'm thinking it's because I tried to generate myself
There is one thing Gemini 2.5 Flash Image can do that no other edit model can do: incorporate multiple images simultaneously without shenanigans due to its multimodality, e.g. for Flux Kontext, if you want to "put the person in the first image into the second image", you have to concatenate them pre-VAE which can be unwieldly, but this model doesn't have that issue. You can even incorporate more than two images, but that may cause too much chaos.
In quick testing, prompt adherence does appear to be much better for massive prompt and the syntatic sugar does appear to be more effective. And there are other tricks not covered which I suspect may allow more control, but I'm still testing.
Given that generations are at the same price as its competitors, this model will shake things up.
I very much enjoy this feature. My next door neighbor is on vacation, and I'm feeding his fish for him. I took a picture of the fish tank and asked Gemini to put the fish tank at various local tourist attractions in my city, as if we're going on day trips.
I send him one photo a day and he's been loving it. Just a fun little thing to put a smile on his face (and mine).
Fun fact - I trained a lora on our almost-toddler at the time on SDXL and generated images of her doing dangerous things to send to my wife the first day she had a trip away from us.
It was all fun and games until the little shit crawled out of our doggy door for the first and only time when I was going to the bathroom. As I was looking for her I got a notification we were in a tornado warning.
Luckily the dog knew where she had gone and led me to her, having crawled down our (3 step) deck, across our yard, and was standing looking up at the angry clouds.
it can't put two images of people together in one photo, this model still has the issue, also, I have seen cases where Flux Kontext works better in things like removing objects
I digitised our family photos but a lot of them were damaged (shifted colours, spills, fingerprints on film, spots) that are difficult to correct for so many images. I've been waiting for image gen to catch up enough to be able to repair them all in bulk without changing details, especially faces. This looks very good at restoring images without altering details or adding them where they are missing, so it might finally be time.
All of the defects you have listed can be automatically fixed by using a film scanner with ICE and a software that automatically performs the scan and the restoration like Vuescan. Feeding hundreds (thousands?) of photos to an experimental proprietary cloud AI that will give you back subpar compressed pictures with who knows how many strange artifacts seems unnecessary
I scanned everything into 48-bit RAW and treat those as the originals, including the IR scan for ICE and a lower quality scan of the metadata. The problem is sharing them - important images I manually repair and export as JPEG which is time consuming (15-30 minutes per image, there are about 14000 total) so if its "generic family gathering picture #8228" I would rather let AI repair it, assuming it doesn't butcher faces and other important details. Until then I made a script that exports the raws with basic cropping and colour correction but it can't fix the colours which is the biggest issue.
Vuescan is terrible. SilverFast has better defaults. But nothing beats the orig Nikon scan software when using ICE. It does a great job of removing dust, fingerprints etc Even when you zoom in. VS what iSRD does in SilverFast, which if you zoom in and compare the 2. iSRD kinda smooches/blurs the infrared defects whereas Nikon Scan clones the surrounding parts, which usually looks very good when zooming in.
Both Silverfast and Nikon Scan methods look great when zoomed out.
I never tried Vuescan's infrared option. I just felt the positive colors it produced looks wrong/"dead".
I've been waiting for image gen to catch up enough to be able to repair them all in bulk without changing details, especially faces.
I've been waiting for that, too. But I'm also not interesting in feeding my entire extended family's visual history into Google for it to monetize. It's wrong for me to violate their privacy that way, and also creepy to me.
Am I correct to worry that any pictures I send into this system will be used for "training?" Is my concern overblown, or should I keep waiting for AI on local hardware to get better?
You're looking for Flux Kontext, a model you can run yourself offline on a high end consumer GPU. Performance and accuracy are okay, not groundbreaking, but probably enough for many needs.
I don't really understand the point of this usecase. Like, can't you also imagine what the photos might look like without the damage? Same with AI upscaling in phone cameras... if I want a hypothetical idea of what something in the distance might look like, I can just... imagine it?
I think we will eventually have AI based tools that are just doing what a skilled human user would do in Photoshop, via tool-use. This would make sense to me. But just having AI generate a new image with imagined details just seems like waste of time.
Do you happen to know some software to repair/improve video files? I'm in the process of digitalizing a couple of Video 2000 and VHS casettes of childhood memories of my mom who start suffering from dementia. I have a pretty streamlined setup for digitalizing the videos but I'd like to improve the quality a bit.
I tried a dozen or so images. For some it definitely failed (altering details, leaving damage behind, needing a second attempt to get a better result) but on others it did great. With a human in the loop approving the AI version or marking it for manual correction I think it would save a lot of time.
Sure, I could manually correct that quite easily and would do a better job, but that image is not important to us, it would just be nicer to have it than not.
I'll probably wait for the next version of this model before committing to doing it, but its exciting that we're almost there.
Another question/concern for me: if I restore an old picture of my Gramma, will my Gramma (or a Gramma that looks strikingly similar) ever pop up on other people's "give me a random Gramma" prompts?
I can imagine an automated blackmail bot that scrapes image, video, voice samples from anyone with the most meagre online presence, which then creates high resolution videos of that person doing the most horrid acts, then threatening to share those videos with that person's family, friends and business contacts unless they are paid $5000 in a cryptocurrency to an anonymous address.
And further, I can imagine some person actually having such footage of themselves being threatened to be released, then using the former narrative as a cover story were it to be released. Is there anything preventing AI generated images, video, etc from being always detectible by software that can intuit if something is AI? what if random noise is added, would the "Is AI" signal persist just as much as the indication to human that the footage seems real?
I’m more bullish on cryptographic receipts than on AI detectors. Capture signing (C2PA) plus an identity bind could give verifiable origin. The hard parts, in my view, are adoption and platform plumbing.
If we have a trust worthy way to verify proof-of-human made content than anything missing those creds would be red flags.
This seems absolutely silly, it's not hard to take a photo of a photo and there's both analog (building a lightbox) and digital (modifying the sensor input) means which would make this entirely trivial to spoof.
SynthID claims to be designed to persist through several methods of modification. I suspect such attacks you mention will happen, but by those with deep pockets. Like a nation-state actor with access to models that don't produce watermarks.
But these new amazing AI image generators lets you just say "It wasn't me, it is an AI fake". Long term they will seriously devalue blackmail material.
I read a scifi novel where they invented a wormhole that only light could pass through but it could be used as a camera that could go anywhere and eventually anytime and there was absolutely no way to block it. So some people adapted to this fact by not wearing clothes anymore.
Don't know why you're being downvoted. That is the logical conclusion.
Although, there's also a chance that those "blackmail gangs" never materialize. After all, you could already ten years ago pay cheap labor to create reasonably good fake images using Photoshop.
I tried to reproduce the fork/spaghetti example and the fashion bubble example, and neither looks anything like what they present. The outputs are very consistent, too. I am copying/pasting the images out of the advertisement page so they may be lower resolution than the original inputs, but otherwise I'm using the same prompts and getting a wildly different result.
It does look like I'm using the new model, though. I'm getting image editing results that are well beyond what the old stuff was capable of.
The output consistency is interesting. I just went through half a dozen generations of my standard image model challenge, (to date I have yet to see a model that can render piano keyboard octaves correctly, and Gemini 2.5 Flash Image is no different in that regard), and as best I can tell, there are no changes at all between successive attempts: https://g.co/gemini/share/a0e1e264b5e9
This is in stark contrast to ChatGPT, where an edit prompt typically yields both requested and unrequested changes to the image; here it seems to be neither.
Flash 2.0 Image had the same issue: it does better than gpt-image for maintaining consistency in edits, but that also introduces a gap where sometimes it gets "locked in" on a particular reference image and will struggle to make changes to it.
In some cases you'll pass in multiple images + a prompt and get back something that's almost visually indistinguishable from just one of the images and nothing from the prompt.
Wildly different and subjectively less "presentable", to be clear. The fashion bubble just generates a vague bubble shape with the subject inside it instead of the"subject flying through the sky inside a bubble" presented on the site. The other case just adds the fork to the bowl of spaghetti. Both are reproducible.
Arguably they follow the prompt better than what Google is showing off, but at the same time look less impressive.
Are their models that have vector space that includes ideas, not just words/media but not entirely corporeal aspects?
So when generating a video of someone playing a keyboard the model would incorporate the idea of repeating groups of 8 tones, which is a fixed ideational aspect which might not be strongly represented in words adjacent to "piano".
It seems like models need help with knowing what should be static, or homomorphic, across or within images associated with the same word vectors and that words alone don't provide a strong enough basis [*1] for this.
*1 - it's so hard to find non-conflicting words, obviously I don't mean basis as in basis vectors, though there is some weak analogy.
Interesting! I feel like that's maybe similar to the business of being able to correctly generate images of text— it looks like the idea of a keyboard to a non-musician, but is immediately wrong to someone who is actually familiar with it at all.
I wonder if the bot is forced to generate something new— certainly for a prompt like that it would be acceptable to just pick the first result off a google image search and be like "there, there's your picture of a piano keyboard".
Anything that is heavily periodic can definitely trip up image gen - that being I just used Flux Kontext T2I and got a got pretty close (disregard the hammers though since thats a right mess). Only towards the upper register did it start to make mistakes.
I guess the vast majority of images have the palms the other way, that this biases the output. It's like how we misinterpret images to generate optical illusions, because we're expecting valid 3D structures (Escher's staircases, say).
Just search nano banana on Twitter to see the crazy results. An example. https://x.com/D_studioproject/status/1958019251178267111
Still needs more RLHF tuning I guess? As the previous version was even worse.
Something similar has been the case with text models. People write vague instructions and are dissatisfied when the model does not correctly guess their intentions. With image models it's even harder for model to guess it right without enough details.
Dead Comment
But look at that example. With this new frontier of AI, that world class engineering talent can finally be put to use…for product placement. We’ve come so far.
Did you think that Google would just casually allow their business to be disrupted without using the technology to improve the business and also protecting their revenue?
Both Meta and Google have indicated that they see Generative AI as a way to vertically integrate within the ad space, disrupting marketing teams, copyrighters, and other jobs who monitor or improve ad performance.
Also FWIW, I would suspect that the majority of Google engineers don't work on an ad system, and probably don't even work on a profitable product line.
“Nano banana” is probably good, given its score on the leaderboard, but the examples you show don't seem particularly impressive, it looks like what Flux Kontext or Qwen Image do well already.
Dead Comment
Dead Comment
No it's not.
We've had rich editing capabilities since gpt-image-1, this is just faster and looks better than the (endearingly? called) "piss filter".
Flux Kontext, SeedEdit, and Qwen Edit are all also image editing models that are robustly capable. Qwen Edit especially.
Flux Kontext and Qwen are also possible to fine tune and run locally.
Qwen (and its video gen sister Wan) are also Apache licensed. It's hard not to cheer Alibaba on given how open they are compared to their competitors.
We've left the days of Dall-E, Stable Diffusion, and Midjourney of "prompt-only" text to image generation.
It's also looking like tools like ComfyUI are less and less necessary as those capabilities are moving into the model layer itself.
Gpt4 isn't "fundamentally different" from gpt3.5. It's just better. That's the exact point the parent commenter was trying to make.
It's not even close. https://twitter.com/fareszr/status/1960436757822103721
https://genai-showdown.specr.net
This model gets 8 of the 12 prompts correct and easily comes within striking distance of the best-in-class models Imagen and gpt-image-1 and is a significant upgrade over the old Gemini Flash 2.0 model. The reigning champ, gpt-image-1, only manages to edge out Flash 2.5 on the maze and 9-pointed star.
What's honestly most astonishing to me is how long gpt-image-1 has remained at the top of the class - closing in on half a year which is basically a lifetime in this field. Though fair warning, gpt-image-1 is borderline useless as an "editor" since it almost always changes the whole image instead of doing localized inpainting-style edits like Kontext, Qwen, or Nano-Banana.
Comparison of gpt-image-1, flash, and imagen.
https://genai-showdown.specr.net?models=OPENAI_4O%2CIMAGEN_4...
> midjourney covers 9 of 8 arms with sock puppets.
Midjourney is shown as a fail so I'm not sure what your point is. And those don't even look remotely close to sock puppets, they resemble stockings at best.
I will note that, personally, while adherence is a useful measure, it does miss some of the qualitative differences between models. For your "spheron" test for example, you note that "4o absolutely dominated this test," but the image exhibits all the hallmarks of a ChatGPT-generated image that I personally dislike (yellow, with veiny, almost impasto brush strokes). I have stopped using ChatGPT for image generation altogether because I find the style so awful. I wonder what objective measures one could track for "style"?
It reminders be a bit of ChatGPT vs Claude for software development... Regardless of how each scores on benchmarks, Claude has been a clear winner in terms of actual results.
Came into this thread looking for this post. It's a great way to compare prompt adherence across models. Have you considered adding editing capabilities in a similar way given the recent trend of inpainting-style prompting?
I've done some experimentation with Qwen and Kontext and been pretty impressed, but it would be nice to see some side by sides now that we have essentially three models that are capable of highly localized in-painting without affecting the rest of the image.
https://mordenstar.com/blog/edits-with-kontext
Do you know of any similar sites that that compares how well the various models can adhere to a style guide? Perhaps you could add this?
I.e. pride the model with a collection of drawings in a single style, then follow prompts and generate images in the same style?
For example if you wanted to illustrate a book, and have all the illustrations look like they were from the same artists.
It's basically a necessity if you're working on something like a game or comic where you need consistency around characters, sprites, etc.
This trend of tools that point a finger at you and set guardrails is quite frustrating. We might need a new OSS movement to regain our freedom.
So far no model is willing to clean it up :/
However, the results the comfyui people get are lightyears ahead of any oneshot-prompt model. Either you can find someone to do cleanup for you (should be trivial, I wouldn't pay more than $10-15) or if you have good specs for inference you could learn to do it yourself.
Keep in mind no editing model is magic and if the pixels just aren’t there for their faces it’s essentially going to be making stuff up.
Deleted Comment
In quick testing, prompt adherence does appear to be much better for massive prompt and the syntatic sugar does appear to be more effective. And there are other tricks not covered which I suspect may allow more control, but I'm still testing.
Given that generations are at the same price as its competitors, this model will shake things up.
I send him one photo a day and he's been loving it. Just a fun little thing to put a smile on his face (and mine).
It was all fun and games until the little shit crawled out of our doggy door for the first and only time when I was going to the bathroom. As I was looking for her I got a notification we were in a tornado warning.
Luckily the dog knew where she had gone and led me to her, having crawled down our (3 step) deck, across our yard, and was standing looking up at the angry clouds.
Both Silverfast and Nikon Scan methods look great when zoomed out. I never tried Vuescan's infrared option. I just felt the positive colors it produced looks wrong/"dead".
I've been waiting for that, too. But I'm also not interesting in feeding my entire extended family's visual history into Google for it to monetize. It's wrong for me to violate their privacy that way, and also creepy to me.
Am I correct to worry that any pictures I send into this system will be used for "training?" Is my concern overblown, or should I keep waiting for AI on local hardware to get better?
I think we will eventually have AI based tools that are just doing what a skilled human user would do in Photoshop, via tool-use. This would make sense to me. But just having AI generate a new image with imagined details just seems like waste of time.
If you leave to imagination, it's likely they each imagine something different.
In my eyes, one specific example they show (“Prompt: Restore photo”) deeply AI-ifies the woman’s face. Sure it’ll improve over time of course.
This is the first image I tried:
https://i.imgur.com/MXgthty.jpeg (before)
https://i.imgur.com/Y5lGcnx.png (after)
Sure, I could manually correct that quite easily and would do a better job, but that image is not important to us, it would just be nicer to have it than not.
I'll probably wait for the next version of this model before committing to doing it, but its exciting that we're almost there.
And further, I can imagine some person actually having such footage of themselves being threatened to be released, then using the former narrative as a cover story were it to be released. Is there anything preventing AI generated images, video, etc from being always detectible by software that can intuit if something is AI? what if random noise is added, would the "Is AI" signal persist just as much as the indication to human that the footage seems real?
If we have a trust worthy way to verify proof-of-human made content than anything missing those creds would be red flags.
https://iptc.org/news/googles-pixel-10-phone-supports-c2pa-u...
I read a scifi novel where they invented a wormhole that only light could pass through but it could be used as a camera that could go anywhere and eventually anytime and there was absolutely no way to block it. So some people adapted to this fact by not wearing clothes anymore.
Erm... What?
Although, there's also a chance that those "blackmail gangs" never materialize. After all, you could already ten years ago pay cheap labor to create reasonably good fake images using Photoshop.
Deleted Comment
Deleted Comment
It does look like I'm using the new model, though. I'm getting image editing results that are well beyond what the old stuff was capable of.
This is in stark contrast to ChatGPT, where an edit prompt typically yields both requested and unrequested changes to the image; here it seems to be neither.
In some cases you'll pass in multiple images + a prompt and get back something that's almost visually indistinguishable from just one of the images and nothing from the prompt.
Arguably they follow the prompt better than what Google is showing off, but at the same time look less impressive.
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
for instance:
https://aistudio.google.com/app/prompts/1gTG-D92MyzSKaKUeBu2...
So when generating a video of someone playing a keyboard the model would incorporate the idea of repeating groups of 8 tones, which is a fixed ideational aspect which might not be strongly represented in words adjacent to "piano".
It seems like models need help with knowing what should be static, or homomorphic, across or within images associated with the same word vectors and that words alone don't provide a strong enough basis [*1] for this.
*1 - it's so hard to find non-conflicting words, obviously I don't mean basis as in basis vectors, though there is some weak analogy.
I wonder if the bot is forced to generate something new— certainly for a prompt like that it would be acceptable to just pick the first result off a google image search and be like "there, there's your picture of a piano keyboard".
https://imgur.com/a/fyX42my
https://imgur.com/a/H9gH3Zy