Am I the only one who doesn't see an obvious difference in the quality between the left and right photos? (Maybe the wolf one) And these are extremely-curated examples!
Objective comparison is always so tricky with Stable Diffusion. They should show off large batches, at the very least.
I think Stability is ostensibly showing that the images are closer to the prompt (and the left wolf in particular has some distortion around the eyes).
No, I have generated a few thousand midjourney images and there is quite a difference in these images actually.
It is hard to describe but there is a very unnatural "sheen" to the images on the left.
The SDXL 0.9 images look more photo realistic but they still aren't quite at the level that midjourney can do.
The best example is the wolf's hair between the ears in the SDXL 0.9 image. It is just a little too noisy and wavey compared to how a real wolf photo would look. Midjourney 5.1 --style raw would still handily beat this image if making a photo realistic wolf.
The jacket on the Alien in the SDXL 0.9 image also has too much of that AI sheen but it kind of works in this image as an effect for the jacket material so not really the best example.
The coffee cup isn't very good on either of them IMO. The trees on the right are still not blurred quite right. They are hiding the hand with this image on the right too. You can see how bad the little and ring finger is on the left image.
For the aliens, the right image has much more realistic gradation. The one on the right looks like the grays have been crushed out of it. There's also a funky glow coming from the right edge of the alien.
I'd say the blur effects on the left images are much cleaner as well. There are some weird artifacts at the fringes of objects in the earlier version.
At the resolution provided they are indeed very close. In my eyes:
In the first example, the second image is more representative of Las Vegas for the foreigner I am, but none of them hav ethe scratchy found film requirement
In the second example, both fit the prompt, but the first image look more coming from a documentary than the second one
in the third example, the hand from the second picture looks much better
The wolf looks better, but also looks less like what you'd see in a "nature documentary" (part of the prompt).
I think the coffee cup looks better in the right phot, it seems a tad bit more real to me.
Like you I much prefer the alien photo on the left, but the photos are so stylistically different I'm not sure that says anything about the releases' respective capabilities.
I prefer the composition of the beta model over the release. Quality wise I can’t say one is better than the other. Maybe the hand in the coffee picture is better for the 0.9 model.
Combining the results of multiple models and then adding another layer onto the combined output tends to increase accuracy / reduce error rates. (not new to AI: it's been done for over a decade)
“ The model can be accessed via ClipDrop today with API coming shortly. Research weights are now available with an open release coming mid-July as we move to 1.0.”
I read this as: commercial use through our API now, self hosted commercial use in July.
NGL I can't wait to get hold of this model file and run it locally, ill be sure to do a write up on it on my AI blog https://soartificial.com. I just hope that my GPU can handle this locally. I don't think 8gb of vram is going to be enough, might have to tinker with some settings.
Im just looking forward to the custom LoRA files we can use with it :D
Edit - it’s not the RAM. 1080TI has 11GB and this press release says it requires 8. So I’m going to speculate that it’s because 1080 lacks tensor cores compared to the 20x’s Turing architecture
Since it is now split into two models to do the generation, you could load one and do the first stage of a bunch of images, then load the second and complete them, with half the vram usage.
Low precision support, almost certainly. SD 1.5 needs almost twice the memory on a 10xx card as on 20xx, because you can't use FP16; a triple bummer, since that makes it even slower (memory bandwidth!) and you don't have as much to begin with.
Any speculation why the AMD cards require twice the VRAM that Nvidia cards do? I have an RX 6700 XT and I'm disappointed that my 12 GB won't be enough.
Text will be better due to simple scale, but the text will still be limited due to the use of a CLIP for text encoding (BPEs+contrastive). So that may be SD XL 0.9 but it should still be worse due to not using T5 like https://github.com/deep-floyd/IF
I think Stability is ostensibly showing that the images are closer to the prompt (and the left wolf in particular has some distortion around the eyes).
Comps here - https://imgur.com/a/FfECIMP
Does anyone have comparisons of how the model does on specific artist styles?
Simple prompts like "By $ARTISTNAME" worked very well in SD v1.5, and less so in v2.x, depending on the artist in question.
It is hard to describe but there is a very unnatural "sheen" to the images on the left.
The SDXL 0.9 images look more photo realistic but they still aren't quite at the level that midjourney can do.
The best example is the wolf's hair between the ears in the SDXL 0.9 image. It is just a little too noisy and wavey compared to how a real wolf photo would look. Midjourney 5.1 --style raw would still handily beat this image if making a photo realistic wolf.
The jacket on the Alien in the SDXL 0.9 image also has too much of that AI sheen but it kind of works in this image as an effect for the jacket material so not really the best example.
The coffee cup isn't very good on either of them IMO. The trees on the right are still not blurred quite right. They are hiding the hand with this image on the right too. You can see how bad the little and ring finger is on the left image.
Obviously, this is all very nit picky.
I'd say the blur effects on the left images are much cleaner as well. There are some weird artifacts at the fringes of objects in the earlier version.
The real comparison should be with SD 1.5/2.1, and is WAY better.
In the first example, the second image is more representative of Las Vegas for the foreigner I am, but none of them hav ethe scratchy found film requirement
In the second example, both fit the prompt, but the first image look more coming from a documentary than the second one
in the third example, the hand from the second picture looks much better
I think the coffee cup looks better in the right phot, it seems a tad bit more real to me.
Like you I much prefer the alien photo on the left, but the photos are so stylistically different I'm not sure that says anything about the releases' respective capabilities.
Also, they’ve (re-) established a universal law of AI: fuck it, just ensemble it
Not sure if true, sounds plausible tho.
I read this as: commercial use through our API now, self hosted commercial use in July.
The Stability AI API/DreamStudio API is slightly different. Yes, it's confusing.
Im just looking forward to the custom LoRA files we can use with it :D
RIP my 1080 TI.
Does anyone know what specific feature they need which 20+ cards have and older ones don't?
Edit - it’s not the RAM. 1080TI has 11GB and this press release says it requires 8. So I’m going to speculate that it’s because 1080 lacks tensor cores compared to the 20x’s Turing architecture
My guess is AMD users will eventually get low VRAM compatibility through Vulkan ports (like SHARK/Torch MLIR or Apache TVM).
Then again, the existing Vulkan ports were kinda obscure and unused with SD 1.5/2.1
We had this for SD 1.5, but it always stayed obscure and unpopular for some reason... I hope its different this time around.
1:https://twitter.com/emostaque/status/1671885525639380992