Stability AI Launches Stable Diffusion XL 0.9

Am I the only one who doesn't see an obvious difference in the quality between the left and right photos? (Maybe the wolf one) And these are extremely-curated examples!

brucethemoose2 · 3 years ago

Objective comparison is always so tricky with Stable Diffusion. They should show off large batches, at the very least.

I think Stability is ostensibly showing that the images are closer to the prompt (and the left wolf in particular has some distortion around the eyes).

famouswaffles · 3 years ago

It really is a much better base model aesthetically than 1.5, 2.1 etc

Comps here - https://imgur.com/a/FfECIMP

webmaven · 3 years ago

Thanks!

Does anyone have comparisons of how the model does on specific artist styles?

Simple prompts like "By $ARTISTNAME" worked very well in SD v1.5, and less so in v2.x, depending on the artist in question.

itairall · 3 years ago

No, I have generated a few thousand midjourney images and there is quite a difference in these images actually.

It is hard to describe but there is a very unnatural "sheen" to the images on the left.

The SDXL 0.9 images look more photo realistic but they still aren't quite at the level that midjourney can do.

The best example is the wolf's hair between the ears in the SDXL 0.9 image. It is just a little too noisy and wavey compared to how a real wolf photo would look. Midjourney 5.1 --style raw would still handily beat this image if making a photo realistic wolf.

The jacket on the Alien in the SDXL 0.9 image also has too much of that AI sheen but it kind of works in this image as an effect for the jacket material so not really the best example.

The coffee cup isn't very good on either of them IMO. The trees on the right are still not blurred quite right. They are hiding the hand with this image on the right too. You can see how bad the little and ring finger is on the left image.

Obviously, this is all very nit picky.

thelogicguy · 3 years ago

For the aliens, the right image has much more realistic gradation. The one on the right looks like the grays have been crushed out of it. There's also a funky glow coming from the right edge of the alien.

I'd say the blur effects on the left images are much cleaner as well. There are some weird artifacts at the fringes of objects in the earlier version.

vitorgrs · 3 years ago

That's because they are actually comparing old version of SDXL vs new one. The old version already improved things...

The real comparison should be with SD 1.5/2.1, and is WAY better.

poulpy123 · 3 years ago

At the resolution provided they are indeed very close. In my eyes:

In the first example, the second image is more representative of Las Vegas for the foreigner I am, but none of them hav ethe scratchy found film requirement

In the second example, both fit the prompt, but the first image look more coming from a documentary than the second one

in the third example, the hand from the second picture looks much better

HelloMcFly · 3 years ago

The wolf looks better, but also looks less like what you'd see in a "nature documentary" (part of the prompt).

I think the coffee cup looks better in the right phot, it seems a tad bit more real to me.

Like you I much prefer the alien photo on the left, but the photos are so stylistically different I'm not sure that says anything about the releases' respective capabilities.

wodenokoto · 3 years ago

I prefer the composition of the beta model over the release. Quality wise I can’t say one is better than the other. Maybe the hand in the coffee picture is better for the 0.9 model.

> Nvidia GeForce RTX 20 graphics card (equivalent or higher standard)

RIP my 1080 TI.

Does anyone know what specific feature they need which 20+ cards have and older ones don't?

binarymax · 3 years ago

Not enough RAM in your 1080TI

Edit - it’s not the RAM. 1080TI has 11GB and this press release says it requires 8. So I’m going to speculate that it’s because 1080 lacks tensor cores compared to the 20x’s Turing architecture

cma · 3 years ago

Since it is now split into two models to do the generation, you could load one and do the first stage of a bunch of images, then load the second and complete them, with half the vram usage.

nolok · 3 years ago

Funny knowing the stupidity nvidia is pulling with the 4xxx series regarding ram amount.

bogwog · 3 years ago

The post says it only needs 8GB, and my 1080 has 11GB.

brucethemoose2 · 3 years ago

My guess is low precision support, or some newer ops Pascal does not support in a custom CUDA kernel.

Filligree · 3 years ago

Low precision support, almost certainly. SD 1.5 needs almost twice the memory on a 10xx card as on 20xx, because you can't use FP16; a triple bummer, since that makes it even slower (memory bandwidth!) and you don't have as much to begin with.

dist-epoch · 3 years ago

RTX 20 have tensor cores. 1080 do not.

joshuahedlund · 3 years ago

xigency · 3 years ago

The hands look better but there’s still hints of a sixth finger in each of them.

krunck · 3 years ago

The last image has big toes for thumbs.

philshem · 3 years ago

https://en.wikipedia.org/wiki/Brachydactyly_type_D

Nothing a hands LORA can't fix.

True, but you still won’t get a coherent picture of someone picking their nose I bet.

seydor · 3 years ago

and 5 phalanges

bbor · 3 years ago

For anyone dumb like me: this is NOT a routine announcement, definitely read it through. The improvements they show off are honestly stunning.

Also, they’ve (re-) established a universal law of AI: fuck it, just ensemble it

isoprophlex · 3 years ago

Apparently that's what GPT4 is too: eight GPT3.5's ensembled together.

Not sure if true, sounds plausible tho.

EGreg · 3 years ago

How does ensembling work? What

minimaxir · 3 years ago

Combining the results of multiple models and then adding another layer onto the combined output tends to increase accuracy / reduce error rates. (not new to AI: it's been done for over a decade)

liuliu · 3 years ago

Many ways, for one: https://magicfusion.github.io/

ilaksh · 3 years ago

So it's non-commercial but they are adding it to the API on Monday? Does that mean it will be commercial then?

“ The model can be accessed via ClipDrop today with API coming shortly. Research weights are now available with an open release coming mid-July as we move to 1.0.”

I read this as: commercial use through our API now, self hosted commercial use in July.

djsavvy · 3 years ago

"Research weights" seems to imply non-commercial use only though, right?

Clipdrop is owned by Stability AI and you can access the model now, with its own API: https://clipdrop.co/stable-diffusion

The Stability AI API/DreamStudio API is slightly different. Yes, it's confusing.

candiodari · 3 years ago

This is copyright: non-commercial for everyone that copies the model from them. Not for the ones producing their own model.

soartificial · 3 years ago

NGL I can't wait to get hold of this model file and run it locally, ill be sure to do a write up on it on my AI blog https://soartificial.com. I just hope that my GPU can handle this locally. I don't think 8gb of vram is going to be enough, might have to tinker with some settings.

Im just looking forward to the custom LoRA files we can use with it :D

LorenDB · 3 years ago

Any speculation why the AMD cards require twice the VRAM that Nvidia cards do? I have an RX 6700 XT and I'm disappointed that my 12 GB won't be enough.

Probably no 4-bit quantization support? Or they are missing some ops that the tensor core cards have?

My guess is AMD users will eventually get low VRAM compatibility through Vulkan ports (like SHARK/Torch MLIR or Apache TVM).

Then again, the existing Vulkan ports were kinda obscure and unused with SD 1.5/2.1

You will want an optimized implementation from torch-mlir or apache tvm anyway.

We had this for SD 1.5, but it always stayed obscure and unpopular for some reason... I hope its different this time around.

jacooper · 3 years ago

Your 6700XT won't work anyway, since its not supported by ROCm

The 6000 series "unofficially" works with rocm, and that is hopefully getting more official.

kreig · 3 years ago

That's not accurate, I was able to make my old RX 5500XT work with ROCm some months ago, only after 6 hours of compilation process though...

dharmab · 3 years ago

It works on my 6900XT with ROCm

heliophobicdude · 3 years ago

If Emad tweeted this image [1] made with SDXL, then text in image could possibly better!

1:https://twitter.com/emostaque/status/1671885525639380992

gwern · 3 years ago

Text will be better due to simple scale, but the text will still be limited due to the use of a CLIP for text encoding (BPEs+contrastive). So that may be SD XL 0.9 but it should still be worse due to not using T5 like https://github.com/deep-floyd/IF

jrflowers · 3 years ago

That’s probably DeepFloyd.

Is not. SDXL can do text, at least some type of text.