Qwen-Image: Crafting with native text rendering

Not sure why this isn’t a bigger deal —- it seems like this is the first open-source model to beat gpt-image-1 in all respects while also beating Flux Kontext in terms of editing ability. This seems huge.

vunderba · 21 days ago

I've been playing around with it for the past hour. It's really good but from my preliminary testing it definitely falls short of gpt-image-1 (or even Imagen 3/4) where reasonably complex strict prompt adherence is concerned. Scored around ~50% where gpt-image-1 scored ~75%. Couldn't handle the maze, Schrödinger's equation, etc.

https://genai-showdown.specr.net

nilsherzig · 20 days ago

Great work, thanks.

Midjourneys images are the only ones which don’t make me uncomfortable (most of the time), hopefully they can fix their prompt adherence.

imcritic · 20 days ago

Thanks for that comparison/overview/benchmark!

However, you have mistakenly marked some answers as correct ones in the octopus prompt: only 1 generated image has octopus have sock puppets on all of its tentacles. And you marked that one image as an incorrect one due to sock looking more like gloves.

xarope · 20 days ago

interesting to see how many still can't handle the nine pointed star correctly

cubefox · 20 days ago

Prompt idea: "A person holding a wooden Penrose triangle." Only GPT-4o image generation is able to make Penrose triangles, as far as I can tell.

bavell · 20 days ago

Fantastic comparisons! Great to see the limitations of the latest models.

supermatt · 20 days ago

Is it fair to call the OpenAI octopus “real”?

jug · 21 days ago

I think it does way more than gpt-image-1 too?

Besides style transfer, object additions and removals, text editing, manipulation of human poses, it also supports object detection, semantic segmentation, depth/edge estimation, super-resolution and novel view synthesis (NVS) i.e. synthesizing new perspectives from a base image. It’s quite a smorgasbord!

Early results indicate to me that gpt-image-1 has a bit better sharpness and clarity but I’m honestly not sure if OpenAI doesn’t simply do some basic unsharp mask or something as a post-processing step? I’ve always felt suspicious about that, because the sharpness seems oddly uniform even in out-of-focus areas? And sometimes a bit much, even.

Otherwise, yeah this one looks about as good.

Which is impressive! I thought OpenAI had a lead here from their unique image generation solution that’d last them this year at least.

Oh, and Flux Krea has lasted four days since announcement! In case this one is truly similar in quality to gpt-image-1.

jacooper · 21 days ago

Not to mention, flux models are for non-commercial use only.

hleszek · 21 days ago

It's not clear from their page but the editing model is not released yet: https://github.com/QwenLM/Qwen-Image/issues/3#issuecomment-3...

zamadatix · 21 days ago

It's only been a few hours and the demo is constantly erroring out, people need more time to actually play with it before getting excited. Some quantized GGUFs + various comfy workflows will also likely be a big factor for this one since people will want to run it locally but it's pretty large compared to other models. Funnily enough, the main comparison to draw might be between Alibaba and Alibaba. I.e. using Wan 2.2 for image generation has been an extremely popular choice, so most will want to know how big a leap Qwen-Image is from that rather than Flux.

The best time to judge how good a new image model actually is seems to be about a week from launch. That's when enough pieces have fallen into place that people have had a chance to really mess with it and come out with 3rd party pros/cons of the models. Looking hopeful for this one though!

rushingcreek · 21 days ago

I spun up an H100 on Voltage Park to give it a try in an isolated environment. It's really, really good. The only area where it seems less strong than gpt-image-1 is in generating images of UI (e.g. make me a landing page for Product Hunt in the style of Studio Ghibli), but other than that, I am impressed.

tetraodonpuffer · 21 days ago

I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

As an aside, I am not sure why for LLM models the technology to spread among multiple cards is quite mature, while for image models, despite also using GGUFs, this has not been the case. Maybe as image models become bigger there will be more of a push to implement it.

reissbaker · 21 days ago

40GB is small IMO: you can run it on a mid-tier Macbook Pro... or the smallest M3 Ultra Mac Studio! You don't need Nvidia if you're doing at-home inference, Nvidia only becomes economical at very high throughput: i.e. dedicated inference companies. Apple Silicon is much more cost effective for single-user for the small-to-medium-sized models. The M3 Ultra is ~roughly on par with a 4090 in terms of memory bandwidth, so it won't be much slower, although it won't match a 5090.

Also for a 20B model, you only really need 20GB of VRAM: FP8 is near-identical to FP16, it's only below FP8 that you start to see dramatic drop-offs in quality. So literally any Mac Studio available for purchase will do, and even a fairly low-end Macbook Pro would work as well. And a 5090 should be able to handle it with room to spare as well.

cma · 21 days ago

If 40GB you can lightly quantize and fit it on a 5090.

TacticalCoder · 21 days ago

> I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

40 GB of VRAM? So two GPU with 24 GB each? That's pretty reasonable compared to the kind of machine to run the latest Qwen coder (which btw are close to SOTA: they do also beat proprietary models on several benchmarks).

minimaxir · 21 days ago

With the notable exception of gpt-image-1, discussion about AI image generation has become much less popular. I suspect it's a function of a) AI discourse being dominated by AI agents/vibe coding and b) the increasing social stigma of AI image generation.

Flux Kontext was a gamechanger release for image editing and it can do some absurd things, but it's still relatively unknown. Qwen-Image, with its more permissive license, could lead to much more innovation once the editing model is released.

ants_everywhere · 21 days ago

There's no social stigma to using AI image generation.

There is what's probably better described as a bullying campaign. People tried the same thing when synthesizers and cameras were invented. But nobody takes it seriously unless you're already in the angry person fandom.

In practice AI image generation is ubiquitous at this point. AI image editing is also built into all major phones.

doctorpangloss · 21 days ago

gpt-image-1 is the League of Legends of image generation. It is a tool in front of like 30 million DAUs...

ACCount36 · 21 days ago

Social stigma? Only if you listen to mentally ill Twitter users.

It's more that the novelty just wore off. Mainstream image generation in online services is "good enough" for most casual users - and power users are few, and already knee deep in custom workflows. They aren't about to switch to the shiny new thing unless they see a lot of benefits to it.

Deleted Comment

SV_BubbleTime · 20 days ago

Considering they have not released their image, editor weights, I’m not sure how you could make a conclusion that it is better than Flux Kontext aside from the graphs they put out.

But, obviously you wouldn’t do that. Right? Did you look at the scaling on their graphs?

rapamycin · 20 days ago

It’s either a shill or just an ai bot

Deleted Comment

rapamycin · 20 days ago

Have you tried the image editing?

blueboo · 21 days ago

Do try it. The image quality and diversity is pretty shocking and not in a good way.

dewarrn1 · 21 days ago

Slightly hyperbolic, gpt-image-1 is better on at least a couple of the text metrics.

toisanji · 21 days ago

how can it beat gpt-image-1 if there is no image editor?

# Configure NF4 quantization quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder"], ) # Load the pipeline with NF4 quantization pipe = DiffusionPipeline.from_pretrained( model_name, quantization_config=quant_config, torch_dtype=torch.bfloat16, use_safetensors=True, low_cpu_mem_usage=True ).to(device)

Does anyone know how they actually trained text rendering into these models?

To me they all seem to suffer from the same artifacts, that the text looks sort of unnatural and doesn't have the correct shadows/reflections as the rest of the image. This applies to all the models I have tried, from OpenAI to Flux. Presumably they are all using the same trick?

yorwba · 21 days ago

It's on page 14 of the technical report. They generate synthetic data by putting text on top of an image, apparently without taking the original lighting into account. So that's the look the model reproduces. Garbage in, garbage out.

Maybe in the future someone will come up with a method for putting realistic text into images so that they can generate data to train a model for putting realistic text into images.

Maken · 21 days ago

Wouldn't it make sense to use rendered images for that?

doctorpangloss · 21 days ago

i'm not sure if that's such garbage as you suggest, surely it is helpful for generalization yes? kind of the point of self-supervised models

halJordan · 21 days ago

If you think diffusing legible, precise text from pure noise is garbage then wtf are you doing here. The arrogance of the it crowd can be staggering at times

Good release! I've added it to the GenAI Showdown site. Overall a pretty good model scoring around 40% - and definitely represents SOTA for something that could be reasonably hosted on consumer GPU hardware (even more so when its quantized).

That being said, it still lags pretty far behind OpenAI's gpt-image-1 strictly in terms of prompt adherence for txt2img prompting. However as has already been mentioned elsewhere in the thread, this model can do a lot more around editing, etc.

Side remark: I don't think it's appropriate to mix Imagen 3 and 4. Those are two different models.

vunderba · 20 days ago

Even though I didn't see a significant improvement over Imagen3 in adherence, I agree. Initially the page was just getting a bit crowded but now that I've added a "Show/Hide Models" toggle I'll go ahead and make that change.

nickandbro · 21 days ago

The fact that it doesn’t change the images like 4o image gen is incredible. Often when I try to tweak someone’s clothing using 4o, it also tweaks their face. This only seems to apply those recognizable AI artifacts to only the elements needing to be edited.

That's why Flux Kontext was such a huge deal - it gave you the power of img2img inpainting without needing to manually mask the content.

https://mordenstar.com/blog/edits-with-kontext

diggan · 20 days ago

Seems strange to not include the prompts themselves, if people are curious in trying to replicate it themselves.

herval · 21 days ago

You can select the area you want edited on 4o, and it’ll keep the rest unchanged

barefootford · 21 days ago

gpt doesn't respect masks

rwmj · 21 days ago

This may be obvious to people who do this regularly, but what kind of machine is required to run this? I downloaded & tried it on my Linux machine that has a 16GB GPU and 64GB of RAM. This machine can run SD easily. But Qwen-image ran out of space both when I tried it on the GPU and on the CPU, so that's obviously not enough. But am I off by a factor of two? An order of magnitude? Do I need some crazy hardware?

icelancer · 21 days ago

> This may be obvious to people who do this regularly

This is not that obvious. Calculating VRAM usage for VLMs/LLMs is something of an arcane art. There are about 10 calculators online you can use and none of them work. Quantization, KV caching, activation, layers, etc all play a role. It's annoying.

But anyway, for this model, you need 40+ GB of VRAM. System RAM isn't going to cut it unless it's unified RAM on Apple Silicon, and even then, memory bandwidth is shot, so inference is much much slower than GPU/TPU.

cellis · 21 days ago

Also I think you need a 40GB "card", not just 40GB of vram. I wrote about this upthread, you're probably going to need one card, I'd be surprised if you could chain several GPUs together.

will the new AMD AI CPUs work? like an AI HX 395 or the slower 370? I'm stuck on an A2000 w/16GB of VRAM and wondering what's a worthwhile upgrade.

mortsnort · 21 days ago

I believe it's roughly the same size as the model files. If you look in the transformers folder you can see there are around 9 5gb files, so I would expect you need ~45gb vram on your GPU. Usually quantized versions of models are eventually released/created that can run on much less vram but with some quality loss.

foobarqux · 21 days ago

Why doesn't huggingface list the aggregate model size?

Model size = file for fp8, so if this was released at fp16 then 40-ish, if it's quantized to fp4 then 10ish

zippothrowaway · 21 days ago

You're probably going to have to wait a couple of days for 4 bit quantized versions to pop up. It's 20B parameters.

pollinations · 21 days ago

seems to use 17gb of vram like this

update: doesn't work well. this approach seems to be recommended: https://github.com/QwenLM/Qwen-Image/pull/6/files

ethan_smith · 21 days ago

Qwen-Image requires at least 24GB VRAM for the full model, but you can run the 4-bit quantized version with ~8GB VRAM using libraries like AutoGPTQ.

liuliu · 21 days ago

16GiB RAM with 8-bit quantization.

This is a slightly scaled up SD3 Large model (38 layers -> 60 layers).

philipkiely · 21 days ago

For prod inference, 1xH100 is working well.

cjtrowbridge · 20 days ago

two p40 cards together will run this for under $300

For PCs I take it one that has two PCIe 4.0 x16 or more recent slots? As in: quite some consumers motherboards. You then put two GPU with 24 GB of VRAM each.

A friend runs this (don't know if the tried this Qwen-Image yet): it's not an "out of this world" machine.

ticulatedspline · 20 days ago

maybe not "out of this world" but still not cheap. probably $4,000 with 3090s. pretty big chunk of change for some ai pictures.

AuryGlenz · 20 days ago

You can’t split diffusion models like that.

pradn · 20 days ago

A silly question: do any of these models generate pixels and also vector overlays? I don't see why we need to solve the text problem pixel-for-pixel if we can just generate higher-level descriptions of the text (text, font, font size, etc). Ofc, it won't work in all situations, but it will result in high fidelity for common business cases (flyers, websites, brochures, etc).

james_a_craig · 20 days ago

In their own first example of English text rendering, it's mistakenly rendered "The silent patient" as "The silent Patient", "The night circus" as "The night Circus", and miskerned "When stars are scattered" as "When stars are sca t t e r e d".

The example further down has "down" not "dawn" in the poem.

For these to be their hero image examples, they're fairly poor; I know it's a significant improvement vs. many of the other current offerings, but it's clear the bar is still being set very low.

sixhobbits · 20 days ago

Given that it was literally a few months ago when these models could barely do text at all, it seems like the bar just gets higher with each advancement, no matter how impressive.

oceanplexian · 21 days ago

artninja1988 · 21 days ago

Insane how many good Chinese open source models they've been releasing. This really gives me hope

owebmaster · 20 days ago

I have the impression this might be a strategy to help boost the AI bubble. Big tech capex rn is too big to fail

tokioyoyo · 20 days ago

Taking a concrete lead in LLM-world would be a big national win for China.