To give a brief glimpse into just how powerful multimodal (LLM / GenAI) can be over more traditional image generation systems like Stable Diffusion, here's a transcript from a conversation with 4o involving reasoning + image generation around the famous French painter Claude Monet.
That is neat, but chatgpt, Gemini, etc. are not even remotely close to being in the same league as models dedicated to image (and video) generation like stable diffusion or Black Forest Labs FLUX, etc. I don't think they ever will be either.
GPT 4o is the most impressive image model in the world, and it represents a full step function change in our capabilities. Images are over now.
Flux, fine tuning, inpainting/outpainting, and ComfyUI are effectively dead. I can show 4o a scribble and it does all the editing for me. Comfy is a total hack compared to that. It's not necessary in the new world.
Every image model and image product that has raised capital is effectively worthless / back at ground zero. They're all inferior to this.
Civitai, Leonardo, OpenArt, Invoke - this is an extinction event for them. They're all worthless products now.
I expect the same to happen to video soon.
If an open weights version of 4o comes out, then I really don't think there will be a product moat for anyone in media.
So I've been experimenting with GenAI around images since basically SD 1.5 so I do speak with at least some level of experience.
I host/run the full FLUX.1-dev model (along with various checkpoint merges such as STOIQO/Chroma/Pixelwave) daily but 4o's multimodal image generation is LEAGUES ahead of Flux in terms of prompt adherence.
I put together a website that compares actually complex prompts across all the SOTA models (Imagen3, Flux, MJ7, and 4o).
Looks like a cool paper. It's really puzzling to me why llama turned out to be so bad, yet they're releasing great research. Especially considering the amount of GPUs, llama really seems unexcusable when compared to Loma from much smaller teams with a lot less resources
Llama will advance further just like the rest. The leaderboards for llms is just a constantly changing thing. They will all reach a maturity point and be roughly the same. That's probably something we'll see soon in the next 1-3 years tops. Then it'll just be incremental price drops in the cost to train and run, but the quality will all be equatable. Not to mention we're already running out of training data.
It was always about more than GPUs since even when the original llama came out, the community released fine tunes that would bench higher than the base model. And with the Deepseek distilled models it turned out you could fine tune some reasoning into a base model and make it perform better.
Just read the latest paper of MetaQueries. Have some thoughts and list here.
Building AI that gets images and creates them in one go (unified models) is the dream. But reality bites: current approaches often mean complex training, weird performance trade-offs (better generation kills understanding?), and clunky control. Just look at the hoops papers like ILLUME+, Harmon, MetaQuery, and Emu3 jump through.
So, what's next? Maybe the answer isn't one giant model trained from scratch (looking at you, Emu3/Chameleon style). The trend, hinted by stuff like GPT-4o and proven by MetaQuery, looks modular.
Prediction 1: Modularity Wins.
Forget monolithic monsters. The smart play seems to be connecting the best pre-trained parts:
Grab a top-tier MLLM (like Qwen, Llama-VL) as the "brain." It already understands vision and language incredibly well.
Plug it into a SOTA generator (diffusion like Stable Diffusion/Sana, or a killer visual tokenizer/decoder if you prefer LLM-native generation) as the "hand."
MetaQuery showed this works shockingly well even keeping the MLLM frozen. Way cheaper and faster than training from zero.
Prediction 2: Pre-trained Everything.
Why reinvent the wheel? Leverage existing SOTA MLLMs and generators. The real work shifts from building the core components to connecting them efficiently. Expect more focus on clever adapters, connectors, and interfaces (MetaQuery's core idea, ILLUME+'s adapters). This lowers the bar and speeds up innovation.
Prediction 3: Generation Heads Don't Matter (as much). Understanding Does.
LLM Head (predicting visual tokens like Emu3/ILLUME+) vs. Diffusion Head (driving diffusion like MetaQuery/ILLUME+ option)? This might become a flexible choice based on speed/quality needs, not a fundamental religious war. ILLUME+'s optional diffusion decoder hints at this.
The real bottleneck isn't the pixel renderer, it's the quality of the control signal. This is where the MLLM brain shines. Diffusion models are amazing renderers but dumb reasoners. A powerful MLLM can:
Understand complex, nuanced instructions.
Inject world knowledge and common sense (MetaQuery proved this: frozen MLLM guided diffusion to draw things needing reasoning).
Potentially output weighted or prioritized control signals (inspired by how fixing attention maps, like in Leffa, boosts detail control – the MLLM could provide that high-level guidance).
The Payoff: Understanding-Driven Control.
This modular, understanding-first approach unlocks:
Truly fine-grained editing.
Generation based on knowledge and reasoning, not just text matching.
Complex instruction following for advanced tasks (subject locking, style mixing, etc.).
Hurdles: Still need better/faster interfaces, good control-focused training data (MetaQuery's mining idea is key), better evals than FID/CLIP, and faster inference.
TL;DR: Future text-to-image looks modular. Use the best pre-trained MLLM brain, connect it smartly to the best generator hand (diffusion or token-based). Let deep understanding drive precise creation. Less focus on one model to rule them all, more on intelligent integration.
> The trend, hinted by stuff like GPT-4o and proven by MetaQuery, looks modular.
Could you make this more explicit? What modularity is hinted at by 4o? The OpenAI blog post you cite (and anything else I've casually heard about it) seems to only imply the opposite.
https://specularrealms.com/ai-transcripts/monets-rainbow
Flux, fine tuning, inpainting/outpainting, and ComfyUI are effectively dead. I can show 4o a scribble and it does all the editing for me. Comfy is a total hack compared to that. It's not necessary in the new world.
Every image model and image product that has raised capital is effectively worthless / back at ground zero. They're all inferior to this.
Civitai, Leonardo, OpenArt, Invoke - this is an extinction event for them. They're all worthless products now.
I expect the same to happen to video soon.
If an open weights version of 4o comes out, then I really don't think there will be a product moat for anyone in media.
I host/run the full FLUX.1-dev model (along with various checkpoint merges such as STOIQO/Chroma/Pixelwave) daily but 4o's multimodal image generation is LEAGUES ahead of Flux in terms of prompt adherence.
I put together a website that compares actually complex prompts across all the SOTA models (Imagen3, Flux, MJ7, and 4o).
https://genai-showdown.specr.net
So, what's next? Maybe the answer isn't one giant model trained from scratch (looking at you, Emu3/Chameleon style). The trend, hinted by stuff like GPT-4o and proven by MetaQuery, looks modular.
Prediction 1: Modularity Wins. Forget monolithic monsters. The smart play seems to be connecting the best pre-trained parts:
Grab a top-tier MLLM (like Qwen, Llama-VL) as the "brain." It already understands vision and language incredibly well.
Plug it into a SOTA generator (diffusion like Stable Diffusion/Sana, or a killer visual tokenizer/decoder if you prefer LLM-native generation) as the "hand." MetaQuery showed this works shockingly well even keeping the MLLM frozen. Way cheaper and faster than training from zero.
Prediction 2: Pre-trained Everything. Why reinvent the wheel? Leverage existing SOTA MLLMs and generators. The real work shifts from building the core components to connecting them efficiently. Expect more focus on clever adapters, connectors, and interfaces (MetaQuery's core idea, ILLUME+'s adapters). This lowers the bar and speeds up innovation.
Prediction 3: Generation Heads Don't Matter (as much). Understanding Does. LLM Head (predicting visual tokens like Emu3/ILLUME+) vs. Diffusion Head (driving diffusion like MetaQuery/ILLUME+ option)? This might become a flexible choice based on speed/quality needs, not a fundamental religious war. ILLUME+'s optional diffusion decoder hints at this.
The real bottleneck isn't the pixel renderer, it's the quality of the control signal. This is where the MLLM brain shines. Diffusion models are amazing renderers but dumb reasoners. A powerful MLLM can:
Understand complex, nuanced instructions.
Inject world knowledge and common sense (MetaQuery proved this: frozen MLLM guided diffusion to draw things needing reasoning).
Potentially output weighted or prioritized control signals (inspired by how fixing attention maps, like in Leffa, boosts detail control – the MLLM could provide that high-level guidance).
The Payoff: Understanding-Driven Control. This modular, understanding-first approach unlocks:
Truly fine-grained editing.
Generation based on knowledge and reasoning, not just text matching.
Complex instruction following for advanced tasks (subject locking, style mixing, etc.).
Hurdles: Still need better/faster interfaces, good control-focused training data (MetaQuery's mining idea is key), better evals than FID/CLIP, and faster inference.
TL;DR: Future text-to-image looks modular. Use the best pre-trained MLLM brain, connect it smartly to the best generator hand (diffusion or token-based). Let deep understanding drive precise creation. Less focus on one model to rule them all, more on intelligent integration.
Could you make this more explicit? What modularity is hinted at by 4o? The OpenAI blog post you cite (and anything else I've casually heard about it) seems to only imply the opposite.
Deleted Comment
Dead Comment