No elephants: Breakthroughs in image generation

This is a before/after moment for image generation. A simple example is the background images on a ton of (mediocre) music youtube channels. They almost all use AI generated images that are full of nonsense the closer you look. Jazz channels will feature coffee shops with garbled text on the menu and furniture blending together. I bet all of that disappears over the next few months.

On another note, and perhaps others are feeling similarly, but I am finding myself surprised at how little use I have for this stuff, LLMs included. If, ten years ago, you told me I would have access to tools like this, I'm sure I would have responded with a never ending stream of ideas and excitement. But now that they're here, I just sort of poke at it for a minute and carry on with my day.

Maybe it's the unreliability on all fronts, I don't know. I ask a lot of programming questions and appreciate some of the autocomplete in vscode, but I know I'm not anywhere close to taking full advantage of what these systems can do.

Gasp0de · 5 months ago

I love using LLMs to generate pictures. I'd call myself rather creative, but absolutely useless in any artistic craft. Now, I can just describe any image I can imagine and get 90% accurate results, which is good enough for the presentations I hold, online pet projects (created a squirrel-themed online math-learning game for which I previously would have needed a designer to create squirrel highschool themed imagery) and memes. For many, many websites this is going to be good enough.

nitwit005 · 5 months ago

> For many, many websites this is going to be good enough.

It was largely a solved problem though. Companies did not seem to have an issue with using stock photos. My current company's website is full of them.

For business use cases, those galleries were already so extensive before AI image generation, that what you wanted was almost always there. They seemingly looked at people's search queries, and added images to match previously failed queries. Even things you wouldn't think would have a photo like "man in business suit jump kicking a guy while screaming", have plenty of results.

candiddevmike · 5 months ago

My problem with finding enjoyment in this is the same problem I have when using cheat codes in games: the doing part is the fun part, getting to the end or just permutations of the end gets really boring.

munksbeer · 5 months ago

>I love using LLMs to generate pictures. I'd call myself rather creative, but absolutely useless in any artistic craft. Now, I can just describe any image I can imagine and get 90% accurate results

May I ask what you use? I'm not yet even a paid subscriber to any of the models, because my company offer a corporate internal subscription chatbot and code integration that works well enough for what I've been doing so far but has no image generation.

I have tried image generation on the free tier but run out of free use before I get anyway pleasing.

What do you pay for?

__loam · 5 months ago

If you use this technology, you're actively harming creative labor.

Retr0id · 5 months ago

I've never used a stock photo site before, so I suppose it's no surprise I have no real use for "generate any image on demand".

esperent · 5 months ago

I've used stock photo sites occasionally but I use vector art and icon sites multiple times a week. Even today, I used an few different sites while designing some stuff on Canva.

The reason I don't use AI is because it gives me far less reliable and impossible to specify results than just searching through the limited lists of human made art.

Today, for undisclosed reasons, I needed vector art of peanuts. I found imperfect but usable human made art within seconds from a search engine. I then spent around 15 - 25 minutes trying to get something closer to my vision using ChatGPT, and using the imperfect art I'd found as a style guide. I got lots of "huh that's cool what AI can do" but nothing useful. Nothing closer to my vision than what I started with.

By coincidence it's the first time I'vr tried making art with AI in about a year, but back then I bought a Midjourney account and spent a month making loads of art, then installed SD on my laptop and spent another couple of weeks playing around with that. So it's not like I'm lacking experience. What I've found so far is that AI art generators are great for generating articles like this one. And they do make some genuinely cool pictures, it blows my mind that computers can do this now.

It's just when I sit down with a real world task that has specific, concrete requirements... I find them useless.

YurgenJurgensen · 5 months ago

Their main application appears to be taking blog posts and internal memos and making them three times longer and use ten times the bandwidth to convey no more information. So exactly the application AI is ‘good’ at.

Suppafly · 5 months ago

>so I suppose it's no surprise I have no real use for "generate any image on demand".

Other than stock photos, porn is the killer app for that, but most of the AI companies don't want to allow that.

avereveard · 5 months ago

How about remove blur from your photo, remove blocking items, denoise darks and fixing whiteouts. Granted it's not quite there yet for everything but it's pretty close.

genewitch · 5 months ago

I have the Gemini app on my phone and you can interact with it with voice only and I was like oh this is really cool I can use it while I'm driving instead of listening to music.

I can never think of anything to talk to an AI about. I run LM local, as well

JFingleton · 5 months ago

Have it interview you (as-in a job interview) on your specialisation. Works your interviewer skills.

Ask it to teach you a language.

DnD works really well (the LLM being the game-master).

loudmax · 5 months ago

That is a very interesting point about how little use of AI most of us making day to day, despite the potential utility that seems to be lurking. I think it just takes time for people and economies to adapt to new technology.

Even if technological progress on AI were to stop today, and the best models that exist in 2030 are the same models we have now, there would still be years of social and economic change as people and companies figure out how to make use of novel technology.

milanove · 5 months ago

Unless I'm doing something simple like writing out some basic shell script or python program, it's often easier to just do something myself than take the time to explain what I want to an LLM. There's something to be said about taking the time to formulate your plan in clear steps ahead of time, but for many problems it just doesn't feel like it's worth the time to write it all out.

skybrian · 5 months ago

Image generation is still very slow. If it generated many images instantly like Google’s image search, it would be a lot more fun to use, and we would learn to use it more effectively with practice.

neuroelectron · 5 months ago

Some of the image generation systems are very fast.

Suppafly · 5 months ago

>Image generation is still very slow.

Only because the free ones slow things down.

nyarlathotep_ · 5 months ago

> They almost all use AI generated images that are full of nonsense the closer you look. Jazz channels will feature coffee shops with garbled text on the menu and furniture blending together.

Noticed tht.

Maybe it's my algorithm but YouTube is seemingly filled with these videos now.

UncleEntity · 5 months ago

They insist on feeding me AI generated videos about "HOA Karens" for some odd reason.

True, I do enjoy watching the LawTubers and sometimes they talk about HOAs but that is a far stretch from someone taking a reddit post and laundering it through the robots.

rasz · 5 months ago

Youtube Studio is has build in AI thumbnail functionality. Google actively encourages use of AI to clickbait and to generate automatic AI replies to comments ala onlyfaps giving your viewers that feeling of interaction without reading their comments.

_DeadFred_ · 5 months ago

All my music cover images are AI generated. At the same time I refuse to listen to AI music. We're all going to sink alone on this one.

What's frustrating me is if I tell the Youtube algo 'don't recommend' to AI music video channels it stops giving me any music video channels. That's not what I want, I just don't want the AI. They need to seperate the two. But of course they need to not do that with AI cover images because otherwise it would harm me. :)

satvikpendem · 5 months ago

Probably is your algorithm as mine is pretty good in not showing me those low effort channels. Check out extensions like PocketTube, SponsorBlock, and DeArrow to manage your YouTube feeds better.

card_zero · 5 months ago

I was wondering yesterday how AI is coming along for tweening animation frames. I just did a quick search and apparently last year the state of the art was garbage:

https://yosefk.com/blog/the-state-of-ai-for-hand-drawn-anima...

Maybe this multimodal thing can fix that?

GaggiX · 5 months ago

That blog post is a year old.

There has been a lot of progress since then: https://doubiiu.github.io/projects/ToonCrafter/

shostack · 5 months ago

A restaurant near me has a framed monitor that displays some animated art with a scene of a cafe on a street corner. I looked closely and realized it was AI. Chairs were melted together, text was gibberish, trees were not branching properly etc.

If a local restaurant is using this stuff we're near an inflection point of adoption.

avereveard · 5 months ago

It's hard to let them ruin wild because unreliability.

I've built a simracing tool that's about 50% ai code now, and ai is mostly the boilerplate, accelerating prototyping and most of the data structure packing unpacking needed

It never managed to did a pit stop window prediction on its own, but could create a reasonable class to handle tire overheating messages

All in all what I can say from this experiment is that it enabled me to get started as I'm unfamiliar with pygame and the ux is entirely maintained by Ai.

Working on classes togheter sucks as the ai puts too many null checks and try catches making the code unreadable by humans, I pretty much prefer to make sure data is correctly initialized and updated than the huge nest of condition llm produce, so I ended up with clearly defined ai and human components.

It's not prefect yet but I can focus on more valuable thing. And it's a good step up from last year where I just used it to second check and enrich my technical writing and coverting notes into emails.

With vision and image generation I think we're closer to create a feedback loop where the Ai can rapidly self correct it's productions, but the ceiling remains to be seen to understand how far this will go.

thordenmark · 5 months ago

I find these image generators and LLM's in general fairly toy like. Not useful for serious work, but you can create mood boards and idea generators. Kind of like random word generators when you've got writer's block. As soon as you analyze any output from these you can see they are producing nonsense. And as we now know from the Claude paper recently released, these things are far from reasoning.

dontlaugh · 5 months ago

The unreliability and inability to debug are why I think these tools are actually a liability for any serious work.

mycall · 5 months ago

> I am finding myself surprised at how little use I have for this stuff

I think this will change as more practical use cases begin to emerge as this is all brand new. For example, the photos you take with your smartphone can tell a story or be annotated so you can see things in the photos you didn't think about but your profile thinks you might. Things will get more sophisticated soon.

justlikereddit · 5 months ago

I have absolutely no use for my photos being annotated by an AI.

I have had use for LLMs and previous era image gens. I haven't got around to trying the last iterations that this article is about yet.

That use I have had of it is very esoteric, an art mostly forgotten in the digital modernity, it's called "HAVING FUN", by myself, for curiosities, for sharing with friends.

That is by far the greatest usage area and severely underrated. AI for having fun, enjoyment that feels meaningful.

If you're a spam-producer or scam artist, or industrial digi-slop manufacturer or merchant of hype, or some other flavor of paid liar(journalist, influencer, spokesperson, diplomat or politician) then sure, AI will also earn you money. And the facade for this money making enterprises will look shinier for every year that passes but it will be all rotting organics and slops behind that veneer, as it have since many years before my birth.

I'm in the game for the fun part, and that part is noticably improving.

Der_Einzige · 5 months ago

That feeling of not knowing what to do with it is an example of humans being stupid. We are all victims of being "Johnny" in this paper:

https://dl.acm.org/doi/full/10.1145/3544548.3581388

otabdeveloper4 · 5 months ago

> But now that they're here, I just sort of poke at it for a minute and carry on with my day.

Well, that's because they suck, despite all the hype.

They have a use in a professional context, i.e., as replacement for older models and algorithms like BERT or TF/IDF.

But as assistants they're only good as a novelty gag.

InDubioProRubio · 5 months ago

Its because all those projects and ideas you had- where AI could do the fun work and you would be the middle manager- just became a job.

Deleted Comment

It's interesting to hear people side with the artists when in previous discussions on this forum I've gotten significant approval/agreement arguing that copyright is far too long.

As I've argued in the past, I think copyright should last maybe five years: in this modern era, monetizing your work doesn't (usually) have to take more than a short time. I'd happily concede to some sort of renewal process to extend that period, especially if some monetization method is in process. Or some sort of mechanical rights process to replace the "public domain" phase early on. Or something -- I haven't thought about it that deeply.

So thinking about that in this process: everyone is "ghiblifying" things. Studio Ghibli has been around for very nearly 40 years, and their "style" was well established over 35 years ago. To me, that (should) make(s) it fair game.

The underlying assumption, I think, is that all the "starving" artists are being ripped off, but are they? Let's consider the numbers -- there are a handful of large-scale artists whose work is obviously replicable: Ghibli, the Simpsons, Pixar, etc. None of them is going hungry because a machine model can render a prom pic in their style. Then you get the other 99.999% of artists, all of whose work went into the model. They will be hurt, but not specifically because their style has been ingested and people want to replicate their style.

Rather, they will be hurt because no one knows their style, nor cares about it; people just want to be able to say e.g. "Make a charcoal illustration of me in this photo, but make me sitting on a horse in the mountains."

It's very much like the arguments about piracy in the past: 99.99% of people were never going to pay an artist to create that charcoal sketch. The 0.01% who might are arguably causing harm to the artist(s) by not using them to create that thing, but the rest were never going to pay for it in the first place.

All to say it's complicated, and obviously things are changing dramatically, but it's difficult to make the argument that "artists need to be compensated for their work being used to train the model" without both a reasonable plan for how that might be done, and a better-supported argument for why.

ben_w · 5 months ago

Mm.

The arguments about wanting copyright to be life+70 have always felt entitled, to me. Making claims about things for their kids to inherit, when the median person doesn't have the option to build up much of an inheritance anyway, and 70 years isn't just the next generation but the next 2.5 generations.

I don't know the exact duration of copyright that makes sense, the world changes too much and different media behave differently. Feels like nobody should have the right to block remakes of C64 games on copyright grounds, but I wouldn't necessarily say that about books published in the same year.

From what I've seen about the distribution of specifically book sales, where even the top-100 best sellers often don't make enough to justify the time involved, I think that one of the biggest problems with the economics of the arts is a mixture of (1) the low cost of reproduction, and (2) all the other artists.

For the former: There were political campaigns a century ago, warning about the loss of culture when cinemas shifted from live bands to recorded music[0]; Today, if I were so inclined, I can for a pittance listen to any of (I'm told) 100 million musical performances, watch any of 1.24 million movies or TV shows. Even before GenAI, there was a seemingly endless quantity of graphical art.

For the latter: For every new book by a current living author such as Charlie Stross (who is on here sometimes), my limited time is also spread between that and the huge back-catalogue of old classics like the complete works of Conan Doyle, Larry Niven, or Terry Pratchett.

[0] https://www.smithsonianmag.com/history/musicians-wage-war-ag...

cannonpr · 5 months ago

Being someone who has paid a lot of attention to Ghibli, I wouldn’t say their style was well established 35-40 years ago… There is considerable evolution and refinement to their style from Naushika, to later works, both in the artistic style and the philosophical content it presents.

I think allowing it to be fair game would have destroyed something quite beautiful that I’ve watched evolve across 40 years and which I was hoping to see the natural conclusion of without him being bothered by the AI-fication of his work.

gcanyon · 5 months ago

Yeah, of course their style isn't static, but I was taking Kiki's Delivery Service (1989) as a point where much of their visual style was pretty well-established.

another-dave · 5 months ago

I'd agree with limiting copyrights but would do it based on money earned rather than time, so something like when you make $X million, the work becomes public domain.

As a specific example — _A Game of Thrones_ was released in 1996. It picked up awards early on but only became a NYT best seller in 2011, just before the TV show aired.

It would feel harsh for an author to loose all their copyright because their work was a "slow burn" and 5 years have elapsed but they've made little to no money on it.

pixl97 · 5 months ago

>o something like when you make $X million, the work becomes public domain.

https://en.wikipedia.org/wiki/Hollywood_accounting

No, no metrics that can be gamed.

gcanyon · 5 months ago

It’s a super-interesting idea, but GoT seems highly cherry picked: the vast majority of all works would never leave copyright if the requirement was that they clear even $1000.

Avshalom · 5 months ago

>It's interesting to hear people side with the artists when in previous discussions on this forum I've gotten significant approval arguing that copyright is far too long.

Well broadly that's because most arguments about copyright(length/scope) are made against corporations attacking individual artists and arguments about copyright(AI/scope) are made against corporations attacking individual artists.

Taek · 5 months ago

I find it unlikely that someone who was willing to pay an artist for a charcoal sketch would be satisfied with an AI alternative.

You don't just buy art for the aethstetic, you buy it for a lot of reasons and AI doesn't give any of the same satisfaction.

zwnow · 5 months ago

I'm all for paying artists for their work. Unfortunately, same as tattoo artists, some just heavily overcharge for mediocre results (been tattooing myself AND I know a few things about art). Like, sorry, but if you want to earn money doing art, please be good at it...

AlecSchueler · 5 months ago

It's one thing to argue that copyright terms should be shortened, and another to accept that a handful of corporations should be able to forcefully shorten it for certain actors entirely on their own terms.

amazingamazing · 5 months ago

> I think copyright should last maybe five years: in this modern era, monetizing your work doesn't (usually) have to take more than a short time. I

funny how people who say this kind of stuff are never content creators (in the monetization sense).

bko · 5 months ago

There are a lot of programmers on this platform (myself included), and I love that my work has an impact on others.

I have a number of public repos and I have benefitted greatly from other public repos. I hope LLMs made some use of my code.

I wrote blogs for years without any monetization. I hope my ideas influenced someone and would be happy if they made some impact on the reasoning of LLMs.

I'm aware of patent trolls and know people with personal experience with them.

So I generate a lot more content that the typical person and I am still in favor of much looser IP rights as I think they have gone overboard and the net benefit for me, a content creator, is much greater having access to the work of others and being able to use tools like LLMs trained on their work.

rikroots · 5 months ago

My personal preference is for (say) 15-20 years.

And, as a content creator, I practice what I preach - at least when it comes to my poetry: https://rikverse2020.rikweb.org.uk/blog/copyrights

gcanyon · 5 months ago

Not that it should impact the validity of my argument, but I have sold commercial software in the past, and it is absurd that that software will be copyrighted through most of the 21st-century.

6510 · 5 months ago

If you make a blog with nice original long form articles it may take much longer to gain traction. Reproducing the content in "your own" wording quickly gets fuzzy.

I like the practical angle. Any formula that requires monitoring what everyone is doing is unworthy of consideration. Appeal to tradition should not apply.

bongodongobob · 5 months ago

The entitlement of the modern artist/musician is unprecedented. Never have I seen so many people expect to be handed a living because they've posted some "content". Musicians and artists now have global distribution with a plethora of platforms. You have to harness that and then work and grind it out. You have to travel and play gigs and set up a booth at art shows.

There's this new expectation that you should just be able to post some music on Spotify or set up an Etsy shop and get significant passive income. It has never ever worked that way and I feel this new expectation comes from the hustle/influencer types selling it.

Most art is crap and most music isn't worth listening to. In the modern age, it's easy for anyone to be a band or artist and the ability to do this has led to a ton of choice, the market is absolutely flooded. If anyone can do a thing (for very loose values of "do") it's inherently worth less. Only the very best make it and it will always be that way.

Source: made a living as a musician for 20 years. The ones who make it are relentlessly marketing themselves in person. You have to leave the house, be a part of a scene, and be constantly looking for opportunities. No one comes knocking on your door, you must drive your product and make yourself stand out in some way. You make money on merch and gigs, and it's always been that way.

This is all to say that copyright law only affects the top 0.1%. The avg struggling artist will never have to worry about any of this. It's like Bob the mechanic worrying about inheritance taxes. Pipe dream at best.

pixl97 · 5 months ago

I mean, this is about as useful as saying anti-slavery people should become slave owners so they understand the hardships of making money.

My example is extreme to the absurd, so how about we go with

>It's difficult to get a man to understand something when his salary depends on not understanding it.

Workaccount2 · 5 months ago

As you grow older and run through more cycles of general opinions, you realize that pretty much everyone is in it for themselves, what serves them best, and support what narrative aligns with that.

2007: Copyright is garbage and must be abolished (so I can get music/movies free)

2025: Copyright needs to be strengthened (so my artistic abilities retain value)

Der_Einzige · 5 months ago

Correct. This is why Stirner is the best Philosopher. https://en.wikipedia.org/wiki/The_Ego_and_Its_Own

There is nothing other than Egoism.

otabdeveloper4 · 5 months ago

Either abolish it, or make it stronger so it applies to everyone. (OpenAI included.)

"Copyright for thee but not for me" is the worst of all worlds.

UncleEntity · 5 months ago

You forgot:

2024: What do you mean I can't copyright AI generated artwork?

ZoomZoomZoom · 5 months ago

> I think copyright should last maybe five years

If we imagine for a moment that "copyright" is something that works in the interests of a creator, than 5 years is nothing.

A painting can sit fifteen years before it gets to an exhibition with sufficient turn-over and media coverage to draw attention to.

A music album can be released with a shitty label and no support, years later be taken by a more competent one and start selling.

We're living in a world where worth art is constantly flying under the radars so limiting potential even more isn't helpful.

mrdependable · 5 months ago

This sounds more like a problem with you specifically not seeing value in art. Why would you want the incentives not to work in their favor?

jeffreygoesto · 5 months ago

I thought the US wants to re-industrialize now? Then 5 years is laughably short to protect your invest.

sfn42 · 5 months ago

Finally a reasonable take.

xrd · 5 months ago

The Ghibli fight is the same fight that is being fought in the NASDAQ. That is to say, there was an established set of rules that everyone thought were fixed and now they are being radically disrupted. Both the creative industry and the general business industry are trying to figure out what is life going to be like with a totally different and fluid set of regulations, whether it be copyright law or tariffs.

No wonder sama and Trump are so cozy. They both see the same legacy.

x187463 · 5 months ago

Looking at the example where the coffee table is swapped, I notice every time the image is reprocessed it mutates, based on the previous iteration, and objects become more bizarre each time, like chinese whispers.

* The weird-ass basket decoration on the table originally has some big chain links (maybe anchor chain, to keep the theme with the beach painting). By the third version, they're leathery and are merging with the basket.

* The candelabra light on the wall, with branch decorations, turns into a sort of skinny minimalist gold stag head, and then just a branch.

* The small table in the background gradually loses one of its three legs, and ends up defying gravity.

* The freaky green lamps in the window become at first more regular, then turn into topiary.

* Making the carpet less faded turns up the saturation on everything else, too, including the wood the table is made from.

og_kalu · 5 months ago

It's kind of clear that for every request, it generates a new image entirely. Some people are speculating a diffusion decoder but i think it's more likely an implementation of VAR - https://arxiv.org/abs/2404.02905.

So rather than predicting each patch at the target resolution right away, it starts with the image (as patches) at a very small resolution and increasingly scales up. I guess that could make it hard for the model to learn to just copy and paste image tokens for editing like it might for text.

flkiwi · 5 months ago

BUT it's doing a stunningly better job replicating previous scenes than it did before. I asked it just now for a selfie of two biker buddies on a Nevada highway, but one is a quokka and one is a hyrax. It did it. Then I asked for the same photo with late afternoon lighting, and it did a pretty amazing job of preserving the context where just a few months ago it would have had no idea what it had done before.

Also, sweet jesus, after more than a year of hilarious frustration, it now knows that a flying squirrel is a real animal and not just a tree squirrel with butterfly wings.

M4v3R · 5 months ago

Yeah, this is in my opinion the biggest limitation of the current gen GPT 4o image generation: it is incapable of editing only parts of an image. I assume what it does every time is tokenizing the source image, then transforming it according to the prompt and then giving you the final result. For some use cases that’s fine but if you really just want a small edit while keeping the rest of the image intact you’re out of luck.

atommclain · 5 months ago

I thought the selection tool allows you to limit the area of the image that a revision will make changes to, but I tested it and I still see changes outside of the selected area which is good to know.

As an example the tape spindles, among other changes, are different: https://chatgpt.com/share/67f53965-9480-800a-a166-a6c1faa87c...

https://help.openai.com/en/articles/9055440-editing-your-ima...

danielbln · 5 months ago

It just means that you comp it together manually. That's still much better than having to set up some inpainting pipeline or whatever.

iandanforth · 5 months ago

Fwiw pixlr is a good pairing with GPT 4o for just this. Generate with 4o then use pixlr AI tools to edit bits. Especially for removals pixlr (and I'm sure others) are much much faster and quite reliable.

bla3 · 5 months ago

The pictures on the wall change too.

rob74 · 5 months ago

Actually, almost everything changes slightly - the number, shape and pattern of the chairs, the number and pattern of the pillows, the pattern of the curtains, the scene outside the window, the wooden part of the table, the pattern of the carpet... The blue couch stays largely the same, it just loses some detail...

Yes, first a still life and something impressionist, then a blob and a blob, then a smear and a smear. And what about the reflections and transparency of the glass table top? It gets very indistinct. Keep working at the same image and it looks like you'll end up with some Deep Dream weirdness.

I think the fireplace might be turning into some tiny stairs leading down. :)

empath75 · 5 months ago

The vast majority of people wouldn't notice any of that in most contexts in which such an image would be used.

nowittyusername · 5 months ago

There is circumstantial evidence out there that 4o image manipulation isn't done within the 4o image generator in one shot but is a workflow done by an agentic system. Meaning this, user inputs prompt "create an image with no elephants in the room" > prompt goes to an llm which preprocesses the human prompt > outputs a a prompt that it knows works withing this image generator well > create an image of a room > and that llm processed prompt is sent to the image generator. Same happens with edits but a lot more complicated, meaning function calling tools are involved with many layers of edits being done behind the scenes. Try it yourself, take an image, send it it, and have the 4o edit it for you in some way, then ask it to edit again, and again, and so on. you will notice noticeable sepia filter being applied every edit, and the image ends up more and more sepia toned with more edits. This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility. If this was a one shot solution where editing is done within 4o image model by itself, the sepia problem wouldn't be there.

vunderba · 5 months ago

As somebody who actually tried to build a multimodal stable diffusion chat agent about a year back using YOLO to build partial masks for adjustments via inpainting, dynamic controlnets, and a whole host of other things, I highly doubt that it's as simple as an agentic process.

Using the prompt to detect and choose the most appropriate model checkpoint and LoRa(s) along with rewriting a prompt to most appropriately suit the chosen model has been pretty bog standard for a long time now.

echelon · 5 months ago

> Using the prompt to detect and choose the most appropriate model checkpoint and LoRa(s) along with rewriting a prompt to most appropriately suit the chosen model has been pretty bog standard for a long time now.

Which players are doing this? I haven't heard of this approach at all.

Most artistic interfaces want you to visually select a style (LoRA, Midjourney sref, etc.) and will load these under the hood. But it's explicit behavior controlled by the user.

nialv7 · 5 months ago

None of your observations say anything about how these images are generated one way or another.

The only thing we currently have to go off of is OpenAI's own words, which claims the images are generated by a single multimodal model autoregressively, and I don't think they are lying.

pclmulqdq · 5 months ago

Generated autoregressively and generated in one shot are not the same. There is a possibility that there is a feedback loop here. Personally, I wouldn't be surprised if there was a small one, but not nearly the complex agentic workflow that OP may be thinking of.

>This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility. If this was a one shot solution where editing is done within 4o image model by itself, the sepia problem wouldn't be there.

I don't really see that with chatgpt, what I do see is that it's presumably running the same basic query with just whatever you said different each time instead of modifying the existing image. Like if you say "generate a photo of a woman", and get a pic and then say "make her hair blonde", the new image is likely to also have different facial features.

renewiltord · 5 months ago

The prompt enrichment thing is pretty standard. Everyone does that bit, though some make it user-visible. On Grok it used to populate to the frontend via the download name on the image. The image editing is interesting.

All the stable diffusion software I've used names the files after some form of the prompt, and probably because SD weights the first tokens higher than the last tokens, probably as a side effect of the way the CLIP/BLIP works.

I doubt any of these companies have rolled their own interface to stable diffusion / transformers. It's copy and paste from huggingface all the way down.

I'm still waiting for a confirmed Diffusion Language Model to be released as gguf that works with llama.cpp

diggan · 5 months ago

> There is circumstantial evidence out there that 4o image manipulation isn't done within the 4o image generator in one shot

I thought this was obvious? At least from the first time (and only time) I used it, you can clearly see that it's not just creating one image based on the prompt, but instead it first creates a canvas for everything to fit into, then it generates piece by piece, with some coordinator deciding the workflow.

Don't think we need evidence either way when it's so obvious from using it and what you can see while it generates the "collage" of images.

andy12_ · 5 months ago

I mean, it could very well be that it generates image patches autoregressively, but in a pyramidal way (first a very low resolution version, the "canvas", and then each individual patch). This is very similar to VAR [1]

We can't really be sure until OpenAI tells us.

[1] https://arxiv.org/pdf/2404.02905

Voloskaya · 5 months ago

> This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility.

Unconvinced by that tbh. This could simply be a bias with the encoder/decoder or the model itself, many image generation models showed behaviour like this. Also unsure why a sepia filter would always be applied if it was a workflow, what's the point of this?

Personally, I don't believe this is just an agentic workflow. Agentic workflows can't really do anything a human couln't do manually, they just make the process much faster. I spent 2 years working with image models, specifically around controllability of the output, and there is just no way of getting this kind of edits with a regular diffusion model just through smarter prompting or other tricks. So I don't see how an agentic workflow would help.

I think you can only get there via a true multimodal model.

lawlessone · 5 months ago

huh, i wa thinking myself based on how it looked that it was doing layers too. The blurred backgrounds with sharp cartoon characters in front are what made me think this is how they do it.

probably_wrong · 5 months ago

> Is it okay to reproduce the hard-won style of other artists using AI? Who owns the resulting art? Who profits from it? Which artists are in the training data for AI, and what is the legal and ethical status of using copyrighted work for training? These were important questions before multimodal AI, but now developing answers to them is increasingly urgent.

I have to disagree with the conclusion. This was an important discussion to have two to three years ago, then we had it online, and then we more or less agreed that it's unfair for artists to have their works sucked up with no recourse.

What the post should say is "we know that this is unfair to artists, but the tech companies are making too much money from them and we have no way to force them to change".

I don't think there's consensus around that idea. Lots of people (myself included) feel that copyright is already vastly overreaching, and that AI represents forward progress for the proliferation of art in society (its crap today, but digital cameras were crap in 2007 and look where they are now).

Its also not clear for example that Studio Ghibli lost by having their art style plastered all over the internet. I went home and watched a Ghibli film that week, as I'm sure many others did as well. Their revenue is probably up quite a bit right now?

"How can we monetize art" remains an open question for society, but I certainly don't think that AI without restrictions is going to lead to fewer people with art jobs.

kelseyfrog · 5 months ago

I'd take it farther to say that copyright and intellectual property is a legal fiction that ultimately benefits the wealthy[those who can pay to legally enforce it] over small artists.

Small artists get paid to create the art; corporations benefit from exclusivity.

thwarted · 5 months ago

> Its also not clear for example that Studio Ghibli lost by having their art style plastered all over the internet. I went home and watched a Ghibli film that week, as I'm sure many others did as well. Their revenue is probably up quite a bit right now?

This sounds like a rewording of "You won't get paid, but this is a great opportunity for you because you'll get exposure".

DeathArrow · 5 months ago

>Its also not clear for example that Studio Ghibli lost by having their art style plastered all over the internet.

Maybe Studio Ghibli is much more than merely a style. Maybe people aren't looking at their production just for the style.

Most people dislike wearing fake clothes and the dislike wearing fake watches or fake jewelry. Because it isn't just about the style.

Nearly every artist I've spoken to or have seen talk about this technology says it's evil, so at least among the victims of this corporate abuse of the creative community, there's wide consensus that it's bad.

> but I certainly don't think that AI without restrictions is going to lead to fewer people with art jobs.

It's great that you think that but in reality a lot of artists are saying they're getting less work these days. Maybe that's the result of a shitty economy but I find it very difficult to believe this technology isn't actively stealing work from people.

adamredwoods · 5 months ago

The Ghibli style took humans decades to refine and create. All that respect and adoration for the craft and artists and the time it took, is now gone in an instant, making it a shallow trivial thing. Worse is to have another company exploit it with no regard for the ones who helped make it a reality.

The threat of AI produced art will forever trivialise human artistic capabilities. The reality is: why bother when it can be done faster and cheaper? The next generation will leverage it, and those skills will be very rare. It is the nature of technology to do this.

Studio Ghibli might not have been affected yet, but only because the technology is not there yet. What's going to happen when someone can make a competing movie in their style with just a prompt? Should we all just be okay with it because it's been decided that Studio Ghibli has made enough money?

If the effort required to create that can just be ingested by a machine and replicated without consequence, how would it be viable for someone to justify that kind of investment? Where would the next evolution of the art form come from? Even if some company put in the time to create something amazing using AI that does require an investment, the precedent is that it can just be ingested and copied without consequence.

I think aside from what is legal, we need to think about what kind of world we want to live in. We can already plainly see what social media has done to the world. What do you honestly think the world will look like once this plays out?

wavemode · 5 months ago

Companies like Studio Ghibli are not being harmed by AI, small freelance artists are.

> "How can we monetize art" remains an open question for society

Yet much of the best art imho is in the wild to the element while being at home at some random place. Or perhaps in someone's collection forgot and displaced. Art's worth will always be an open question.

hnbad · 5 months ago

Copyright is a logical consequence of property rights. I'd agree that property rights hold back industry and trade but if you want to abolish property rights, you first have to decommodify the essentials like food, housing, public infrastructure and healthcare, because unleashing the market when it has control over all of these is going to have some very undesirable consequences.

eadmund · 5 months ago

> it's unfair for artists to have their works sucked up

I never thought it was unfair to artists for others to look at their work and imitate it. That seems to me to be what artists have been doing since the second caveman looked at a hand painting on a cave wall and thought, ‘huh, that’s pretty neat! I’d like to try my hand at that!’

SirMaster · 5 months ago

You don't see a massive difference in the shear number of images that the AI can look at and the speed at which it can imitate it as a fundamental difference between AI and a human copying works or styles?

For a human it took a lot of practice and a lot of time and effort. But now it takes practically no time or effort at all.

BriggyDwiggs42 · 5 months ago

Right the difference is that it’s a large company looking at it then copying it and reselling it without credit, which basically everyone would understand as bad without the indirection of a model.

Edit: the key words here are “company” and “reselling”

False, copyism as a career has been always looked down at in the arts community. Learning and reinterpreting is a qualitatively different process.

mitthrowaway2 · 5 months ago

https://news.ycombinator.com/item?id=42720749

shkkmo · 5 months ago

> This was an important discussion to have two to three years ago, then we had it online, and then we more or less agreed that it's unfair for artists to have their works sucked up with no recourse.

Speak for yourself, there was no consensus online. There are plenty of us that think that dramatically expanding the power of copyright would be a huge mistake that would primarily benefit larger companies and do little to protect or fund small artists.

OtherShrezzing · 5 months ago

>There are plenty of us that think that dramatically expanding the power of copyright would be a huge mistake that would primarily benefit larger companies and do little to protect or fund small artists.

The status quo also primarily benefits larger companies, and does little (exactly nothing, if we're being earnest) to protect or fund small artists.

It's reasonable to hold both opinions that: 1) artists aren't being compensated, even though their work is being used by these tools, and 2) massive expansion of copyright isn't the appropriate response to 1).

> and then we more or less agreed that it's unfair for artists to have their works sucked up with no recourse.

No we didn't agree with that.

wat10000 · 5 months ago

“Fair” doesn’t matter. The only consensus that matters is what is legal and profitable. The former seems to be pretty much decided in favor of AI, with some open question about whether large media companies enjoy protections that smaller artists don’t. (The legal battle when some AI company finally decides to let their model imitate Disney stuff is going to be epic.) Profitable remains to be seen, but doesn’t matter much while investors’ money is so plentiful.

> The former seems to be pretty much decided in favor of AI

None of the cases against AI companies have been decided afaik. There's a ton of ongoing litigation.

> but doesn’t matter much while investors’ money is so plentiful.

More and more people are realizing how wasteful this stuff is every day.

> What the post should say is "we know that this is unfair to artists, but the tech companies are making too much money from them and we have no way to force them to change".

It seemed a fact of life that companies will just abuse your personal data to their liking and can do what they want with information they collect about you because "if it's free, you're the product" (and even if you paid for it, "you should know better" etc). Then GDPR and its international derivatives came along and changed that.

It seemd a fact of life that companies that technically don't have an actual market monopoly can do whatever they want within their vertically integrated walled gardens because competitors can just create their own vertically integrated walled gardens to compete with them and the rules for markets don't apply to walled gardens. Then the DSA and DMA came along and changed that.

I don't see why legislation can't change this, too. Of course just with the GDPR, DSA and DMA we'll hear from libertarians, megacorps and astroturf movements how unfair it all is to mom & pop startups and how it's going to ruin the economy but I think given the angle grider the US is currently taking to its own economy (and by extension the global economy because we're all connected), I think that's no longer a valid argument in politics.

>> it's unfair for artists to have their works sucked up

What framework can we use to decide if something is fair or not?

Style is not something that should be copyrighted. I can pain in the style of X painter, I can write in the style of Y writer, I can compose music in the style of Z composer.

Everything has a style. Dressing yourself has a style. Speaking has a style. Even writing mathematical proofs can have a style.

Copying another person's style might reflect poor judgement, bad taste and lack of originality but it shouldn't be illegal.

And anyone in the business of art should have much more than a style. He should have original ideas, a vision a way to tell stories, a way to make people ask themselves questions.

A style is merely a tool. If all someone has is a style, then good luck!

yencabulator · 5 months ago

It's already gone quite a bit further than "style". https://www.404media.co/listen-to-the-ai-generated-ripoff-so...

gosub100 · 5 months ago

In music, someone can sing the same style as another, but if they imitate it to the point that there is brand confusion, where the consumer believes the product came from X when it actually came from Y, that's clearly crossing the line.

shubhamjain · 5 months ago

The Ghibli trend completely missed the real breakthrough — and it’s this. The ability to closely follow text, understand the input image, and maintain context of what’s already there is a massive leap in image generation. While Midjourney delivered visually stunning results, I constantly struggled to get anything specific out of it, making it pretty much useless for actual workflows.

4o is the first image generation model that feels genuinely useful not just for pretty things. It can produce comics, app designs, UI mockups, storyboards, marketing assets, and so on. I saw someone make a multi-panel comic with it with consistent characters. Obviously, it's not perfect. But just getting there 90% is a game changer.

I had chatgpt generate a flow chart with mermaid js for something at work and then write a scott mccloud style comic book explaining it in detail and it looked so convincing, even though it got some of the details a bit wrong. It's _so close_ to making completely usable graphics out of the box.

haswell · 5 months ago

> The question isn't whether these tools will change visual media, but whether we'll be thoughtful enough to shape that change intentionally.

Unfortunately I think the answer to this question is a resounding “no”.

The time for thoughtful shaping was a few years ago. It feels like we’re hurtling toward a future where instead we’ll be left picking up the pieces and assessing the damage.

These tools are impressive and will undoubtedly unlock new possibilities for existing artists and for people who are otherwise unable to create art.

But I think it’s going to be a rough ride, and whatever new equilibrium we reach will be the result of much turmoil.

Employment for artists won’t disappear, but certain segments of the market will just use AI because it’s faster, cheaper, and doesn’t require time consuming iterations and communication of vision. The results will be “good enough” for many.

I say this as someone who has found these tools incredibly helpful for thinking. I have aphantasia, and my ability to visualize via AI is pretty remarkable. But I can’t bring myself to actually publish these visualizations. A growing number of blogs and YouTube channels don’t share these qualms and every time I encounter them in the wild I feel an “ick”. It’ll be interesting to see if more people develop this feeling.

>But I think it’s going to be a rough ride, and whatever new equilibrium we reach will be the result of much turmoil.

Honestly visual media just seems to be the start. In the past two years we've seen about as much robotics progress as the last 20. If this momentum keeps up then we're not just talking about artists that are going to have issues.

TheGrognardling · 5 months ago

Honestly, I'm pretty encouraged by all of the projects and efforts within legislation and organizations regarding clear lines being drawn - i.e., through watermarking to clearly label whether something is AI-generated or not - as well as the efforts by industries for livelihoods to be protected, specifically in the creative space, where human intentionality and feeling are still of the utmost essentiality. We've seen, are seeing, and will see cultural and societal acceptance and backlash against one thing or another, but I'm confident that we will adapt. Ultimately, pushback, thanks to the Web itself, is already pretty monumental among artists and even other AI researchers in many respects - regulations for the internet, largely due to lack of the Web, were far slower to materialize, on an exponential scale. I remain optimistic that we will find the niches where AI is needed, where it isn't, and where it is detrimental.

justinator · 5 months ago

But the annotations are still wrong,

https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_...

(nice URL btw)

The room, the door, the ceiling are all of a scale to fit many sizes of elephants.

The lines don't really make sense either, like the one above the sofa should probably go along the corner between the floor and wall.

Just like text results from AI, when images like this get better, the more and more subtle yet absolute wrongness is going to be a nightmare to deal with.

Imagine a ask AI to show me a sewer cap that's less than a foot wide (or whatever, I dunno, watch TMNT right now). And it does, just by showing a sewer cap that looks photorealistic and a ruler, that has markings from one end to the other that only go up to 8 inches. That doesn't mean sewer caps come in that size, it just means you can produce a rendered image to fit what you asked for.