Badly needs inline chat to be complete.
So, what's next? Maybe the answer isn't one giant model trained from scratch (looking at you, Emu3/Chameleon style). The trend, hinted by stuff like GPT-4o and proven by MetaQuery, looks modular.
Prediction 1: Modularity Wins. Forget monolithic monsters. The smart play seems to be connecting the best pre-trained parts:
Grab a top-tier MLLM (like Qwen, Llama-VL) as the "brain." It already understands vision and language incredibly well.
Plug it into a SOTA generator (diffusion like Stable Diffusion/Sana, or a killer visual tokenizer/decoder if you prefer LLM-native generation) as the "hand." MetaQuery showed this works shockingly well even keeping the MLLM frozen. Way cheaper and faster than training from zero.
Prediction 2: Pre-trained Everything. Why reinvent the wheel? Leverage existing SOTA MLLMs and generators. The real work shifts from building the core components to connecting them efficiently. Expect more focus on clever adapters, connectors, and interfaces (MetaQuery's core idea, ILLUME+'s adapters). This lowers the bar and speeds up innovation.
Prediction 3: Generation Heads Don't Matter (as much). Understanding Does. LLM Head (predicting visual tokens like Emu3/ILLUME+) vs. Diffusion Head (driving diffusion like MetaQuery/ILLUME+ option)? This might become a flexible choice based on speed/quality needs, not a fundamental religious war. ILLUME+'s optional diffusion decoder hints at this.
The real bottleneck isn't the pixel renderer, it's the quality of the control signal. This is where the MLLM brain shines. Diffusion models are amazing renderers but dumb reasoners. A powerful MLLM can:
Understand complex, nuanced instructions.
Inject world knowledge and common sense (MetaQuery proved this: frozen MLLM guided diffusion to draw things needing reasoning).
Potentially output weighted or prioritized control signals (inspired by how fixing attention maps, like in Leffa, boosts detail control – the MLLM could provide that high-level guidance).
The Payoff: Understanding-Driven Control. This modular, understanding-first approach unlocks:
Truly fine-grained editing.
Generation based on knowledge and reasoning, not just text matching.
Complex instruction following for advanced tasks (subject locking, style mixing, etc.).
Hurdles: Still need better/faster interfaces, good control-focused training data (MetaQuery's mining idea is key), better evals than FID/CLIP, and faster inference.
TL;DR: Future text-to-image looks modular. Use the best pre-trained MLLM brain, connect it smartly to the best generator hand (diffusion or token-based). Let deep understanding drive precise creation. Less focus on one model to rule them all, more on intelligent integration.
Could you make this more explicit? What modularity is hinted at by 4o? The OpenAI blog post you cite (and anything else I've casually heard about it) seems to only imply the opposite.
I have a theory that a problem for Nix understanding and adoption out of all apparent proportion is its use of ; in a way that is just subtly, right in the uncanny valley, different from what ; means in any other language.
In the default autogenerated file everyone is given to start with, it immediately hits you with:
environment.systemPackages = with pkgs; [ foo ];
How is that supposed to read as a single expression in a pure functional language?Wow!