For reference, here's what you can get with a properly tweaked Stable Diffusion, all running locally on my PC. Can be set up on almost any PC with a mid range GPU in a few minutes if you know what you're doing. I didn't do any cherry picking; this is the first thing it generated. 4 images per prompt.
Can you elaborate on “properly tweaked”? When I use one of the Stable Diffusion and AUTOMATIC1111 templates on runpod.io, the results are absolutely worthless.
This is using some of the popular prompts you can find on sites like prompthero that show amazing examples.
It’s been serious expectation vs. reality disappointment for me and so I just pay the MidJourney or DALL-E fees.
1. Use a good checkpoint. Vanilla stable diffusion is relatively bad. There are plenty of good ones on civitai. Here's mine: https://civitai.com/models/94176
2. Use a good negative prompt with good textual inversions. (e.g. "ng_deepnegative_v1_75t", "verybadimagenegative_v1.3", etc.; you can download those from civitai too) Even if you have a good checkpoint this is essential to get good results.
3. Use a better sampling method instead of the default one. (e.g. I like to use "DPM++ SDE Karras")
There are more tricks to get even better output (e.g. controlnet is amazing), but these are the basics.
Are you using txt2img with the vanilla model? SD's actual value is in the large array of higher-order input methods and tooling; as a tradeoff, it requires more knowledge. Similarly to 3D CGI, it's a highly technical area. You don't just enter the prompt with it.
You can finetune it on your own material, or choose one of the hundreds of public finetuned models. You can guide it in a precise manner with a sketch or by extracting a pose from a photo using controlnets or any other method. You can influence the colors. You can explicitly separate prompt parts so the tokens don't leak into each other. You can use it as a photobashing tool with a plugin to popular image editing software. Things like ComfyUI enable extremely complicated pipelines as well. etc etc etc
You're not going to get even close to Midjourney or even Bing quality on SD without finetuning. It's that simple. When you do finetune, it will be restricted to that aesthetic and you won't get the same prompt understanding or adherence.
For all the promise of control and customization SD boasts, Midjourney beats it hands down in sheer quality. There's a reason like 99% of ai art comic creators stick to Midjourney despite the control handicap.
First off are you using a custom model or the default SD model? The default model is not the greatest. Have you tried controlnet?
But yes SD can be a bit of a pain to use. Think of it like this. SD = Linux, Midjourney = Windows/MacOS. SD is more powerful and user controllable but that also means it has a steeper learning curve.
I am sure you're right, but "if you know what you're doing" does a lot of heavy lifting here.
We could just as easily say "hosting your own email can be set up in a few minutes if you know what you're doing". I could do that, but I couldn't get local SD to generate comparable images if my life depended on it.
thanks for doing this, I would like to include these into the blog post as well. Can I use these and credit you for them? (let me know what you'd like linked)
Those are amazing, please consider writing a blog post of the steps you did to install and tweak Stable Diffusion to achieve these results. I'm sure many of us would love to read it.
The stagnation has been very curious. They are part of a large & generally competent org, which otherwise has remained far ahead of the competition, like GPT-4. Except... for DALL-E 2, where it did not just stagnate for over a year (on top of its bizarre blindspots like garbage anime generation), but actually seemed to get worse. They have an experimental model of some sort that some people have access to, but even there, it's nothing to write home about compared to the best models like Parti or eDiff-I etc.
I suspect that they consider txt2img to be more of a curiosity now. Sure, it's transformative; it's going to upend whole markets (and make some people a lot of money in the process) - however, it's just producing images. Contrast with LLMs, which have already proven to be generally applicable in great many domains, and that if you squint, are probably capturing the basic mechanisms of thinking. OpenAI lost the lead in txt2img, but GPT-4 is still way ahead of every other LLM. It makes sense for them to focus pretty much 100% on that.
Nobody is able to use Parti or eDiff. Compared to models you can use, the experimental Dall-e or Bing Image Creator is second only to midjourney in my experience.
I don't know, what I saw in there (particularly with the haunted house) was a far broader POTENTIAL RANGE of outputs. I get that they were cheesier outputs, but it seems to me that those outputs were just as capable of coming from the other 'AIs'… if you let them.
It's like each of these has a hidden giant pile of negative prompts, or additional positive prompts, that greatly narrow down the range of output. There are contexts where the Dall-E 'spoopy haunted house ooooo!' imagery would be exactly right… like 'show me halloweeny stock art'.
That haunted house prompt didn't explicitly SAY 'oh, also make it look like it's a photo out of a movie and make it look fantastic'. But something in the more 'competitive' AIs knew to go for that. So if you wanted to go for the spoopy cheesey 'collective unconscious' imagery, would you have to force the more sophisticated AIs to go against their hidden requirements?
Mind you if you added 'halloween postcard from out of a cheesey old store' and suddenly the other ones were doing that vibe six times better, I'd immediately concede they were in fact that much smarter. I've seen that before, too, in different Stable Diffusion models. I'm just saying that the consistency of output in the 'smarter' ones can also represent a thumb on the scale.
They've got to compete by looking sophisticated, so the 'Greg Rutkowskification' effect will kick in: you show off by picking a flashy style to depict rather than going for something equally valid, but less commercial.
It's not just about the haunted house. Just look at the DALLE-2 living room pictures closely. None of it makes any sense. And we're not even talking of subtle details, all of the first three pictures have a central object that the eye should be drawn to that's just a total mess. (The table that's being subsumed by a bunch of melting brown chairs in the first one, the i-don't-even-know-what that seems to be the second picture, and the whatever-this-is on the blue carpet.)
OpenAI screwed up that one by trying to control it. StableDiffusion on the other hand, gives me hope that AI can be high quality and open(not only in name).
Can't wait to have something like StableDiffusion but for LLMs.
Midjurney is still so far ahead it's no competition. Did a lot of testing today and firefly generated so much errors with fingers and stuff, not seen that since the original stability release. Anyone know if the web firefly and the Photoshop version is the same model?
It's worth noting the difference in how the training material is sourced though, Midjourney is using indiscriminate web scrapes while Firefly is taking the conservative approach of only using images that Adobe holds a license for. Midjourney has the Sword of Damocles hanging over its head that depending on how legal precedent shakes out, its output might end up being too tainted for commercial purposes, and Adobe is betting on being the safe alternative during the period of uncertainly and if the hammer does come down on web-scraping models.
I'm presuming you're not including Stable Diffusion when you say this; the fact that SD and its variants are defacto extremely "free and open source" presently put it way ahead of anything else, and are likely to do so for some time.
As far as I can tell anyone who’s creating images is using midjourney. This is likely the same “Linux is open so it’s way better” tell that to the trillion dollar companies that bet against that.
I share the same opinion, but also dislike these tests because each system benefits from a different approach to prompting. What I use to get a good result in MidJourney won't work in StableDiffusion for example. Instead when making these comparisons one needs to set an objective and have people who are familiar with each system to produce their nicest images - since this is a better reflection of the real world usage. For example, ask each participant to read a chapter/page from a book with a lot of specific imagery and then use AI to create what they think that looks like.
Regarding image generation in Photoshop I can confirm two things:
- It is excellent for in and out painting with a few exceptions*
- It remains poor for generating a brand new image
*Photoshop's generative fill is very good at extending landscapes, it will match lighting and according to the release video can be smart enough to observe what a reflection should contain even if that is not specifically included in the image (in their launch demo they showed how a reflection pool captured the underside of a vehicle.)
Where generative fill falls apart: Inserting new objects that are not well defined produces problems. Choosing something like a VW Beetle will produce a good result as it is well defined, choosing something like "boat", "dragon", or even "pirate's chest": will produce a range of images that do not necessarily fit the scene - this is likely because source imagery for such objects is likely vague and prone to different representations.
1st note about Firefly: Anything that is likely to produce a spherical looking shape tends to be blocked - likely because it resembles certain human anatomy. This is problematic when doing small touch ups such as fixing fingers.
A special note about photoshop versus other systems: Photoshop has the added problem of needing to match the resolution of the source material. Currently it achieves this from combining upscaling with resizing - this means that if one is extending an area with high detail, that detail cannot be maintained and instead is softer/blurrier than the original sections. It also means that if one extends directly from the border of an image, then a feathered edge becomes visible which must be corrected by hand.
I currently test the following AI generators, feel free to ask me about any of these: StableDiffusion (Automatic and InvokeAI), OpenAI's Dall-E 2, MidJourney, Stability AI's DreamStudio, and Adobe Firefly.
None of these can do text well. There's a model that does do text and composition well, but the name escapes me. And the general quality is much lower overall, so it's a pretty heavy tradeoff.
I had done a similar comparison a couple months back but used Lexica instead of DALL-E.
Seems clear to me that Midjourney has by far the best "vibes" understanding. Most models get the items right but not the lighting. Firefly seems focused on realism which makes sense for a photography audience.
Kind of strange to me that they didn't test any prompts with people in them. In my experience that tends to show the limitations of various models pretty quickly.
Adobe Firefly is actually extremely competent, especially since it doesn't use copyrighted images in its training set. Using MidJourney (which is fantastic) commercially will be a quagmire for the unlucky company that draws a lawsuit.
1st prompt: https://i.postimg.cc/T3nZ9bQy/1st.png
2nd prompt: https://i.postimg.cc/XNFm3dSs/2nd.png
3rd prompt: https://i.postimg.cc/c1bCyqWR/3rd.png
This is using some of the popular prompts you can find on sites like prompthero that show amazing examples.
It’s been serious expectation vs. reality disappointment for me and so I just pay the MidJourney or DALL-E fees.
In a nutshell:
1. Use a good checkpoint. Vanilla stable diffusion is relatively bad. There are plenty of good ones on civitai. Here's mine: https://civitai.com/models/94176
2. Use a good negative prompt with good textual inversions. (e.g. "ng_deepnegative_v1_75t", "verybadimagenegative_v1.3", etc.; you can download those from civitai too) Even if you have a good checkpoint this is essential to get good results.
3. Use a better sampling method instead of the default one. (e.g. I like to use "DPM++ SDE Karras")
There are more tricks to get even better output (e.g. controlnet is amazing), but these are the basics.
You can finetune it on your own material, or choose one of the hundreds of public finetuned models. You can guide it in a precise manner with a sketch or by extracting a pose from a photo using controlnets or any other method. You can influence the colors. You can explicitly separate prompt parts so the tokens don't leak into each other. You can use it as a photobashing tool with a plugin to popular image editing software. Things like ComfyUI enable extremely complicated pipelines as well. etc etc etc
For all the promise of control and customization SD boasts, Midjourney beats it hands down in sheer quality. There's a reason like 99% of ai art comic creators stick to Midjourney despite the control handicap.
But yes SD can be a bit of a pain to use. Think of it like this. SD = Linux, Midjourney = Windows/MacOS. SD is more powerful and user controllable but that also means it has a steeper learning curve.
We could just as easily say "hosting your own email can be set up in a few minutes if you know what you're doing". I could do that, but I couldn't get local SD to generate comparable images if my life depended on it.
screenshot of the options interface: https://stash.cass.xyz/drawthings-1687292611.png
Here, I've uploaded it to civitai: https://civitai.com/models/94176
There are plenty of other good models too though.
Deleted Comment
1: https://media.discordapp.net/attachments/1060989219432054835...
2: https://media.discordapp.net/attachments/1060989219432054835...
3: https://media.discordapp.net/attachments/1060989219432054835...
https://imgur.com/a/siQG06O
https://imgur.com/a/vp2oOHu
update: I've edited the post to include these results as well
Which is fair enough, when you are a (relatively) small company competing with the likes of Google and Meta you really need to focus.
It's like each of these has a hidden giant pile of negative prompts, or additional positive prompts, that greatly narrow down the range of output. There are contexts where the Dall-E 'spoopy haunted house ooooo!' imagery would be exactly right… like 'show me halloweeny stock art'.
That haunted house prompt didn't explicitly SAY 'oh, also make it look like it's a photo out of a movie and make it look fantastic'. But something in the more 'competitive' AIs knew to go for that. So if you wanted to go for the spoopy cheesey 'collective unconscious' imagery, would you have to force the more sophisticated AIs to go against their hidden requirements?
Mind you if you added 'halloween postcard from out of a cheesey old store' and suddenly the other ones were doing that vibe six times better, I'd immediately concede they were in fact that much smarter. I've seen that before, too, in different Stable Diffusion models. I'm just saying that the consistency of output in the 'smarter' ones can also represent a thumb on the scale.
They've got to compete by looking sophisticated, so the 'Greg Rutkowskification' effect will kick in: you show off by picking a flashy style to depict rather than going for something equally valid, but less commercial.
Can't wait to have something like StableDiffusion but for LLMs.
If stable diffusion didn’t launch Dall-e 2 would have been still valuable.
If I create a Mickey Mouse using photoshop would adobe be liable for it?
Regarding image generation in Photoshop I can confirm two things:
- It is excellent for in and out painting with a few exceptions*
- It remains poor for generating a brand new image
*Photoshop's generative fill is very good at extending landscapes, it will match lighting and according to the release video can be smart enough to observe what a reflection should contain even if that is not specifically included in the image (in their launch demo they showed how a reflection pool captured the underside of a vehicle.)
Where generative fill falls apart: Inserting new objects that are not well defined produces problems. Choosing something like a VW Beetle will produce a good result as it is well defined, choosing something like "boat", "dragon", or even "pirate's chest": will produce a range of images that do not necessarily fit the scene - this is likely because source imagery for such objects is likely vague and prone to different representations.
1st note about Firefly: Anything that is likely to produce a spherical looking shape tends to be blocked - likely because it resembles certain human anatomy. This is problematic when doing small touch ups such as fixing fingers.
A special note about photoshop versus other systems: Photoshop has the added problem of needing to match the resolution of the source material. Currently it achieves this from combining upscaling with resizing - this means that if one is extending an area with high detail, that detail cannot be maintained and instead is softer/blurrier than the original sections. It also means that if one extends directly from the border of an image, then a feathered edge becomes visible which must be corrected by hand.
I currently test the following AI generators, feel free to ask me about any of these: StableDiffusion (Automatic and InvokeAI), OpenAI's Dall-E 2, MidJourney, Stability AI's DreamStudio, and Adobe Firefly.
Deleted Comment
Seems clear to me that Midjourney has by far the best "vibes" understanding. Most models get the items right but not the lighting. Firefly seems focused on realism which makes sense for a photography audience.
https://twitter.com/fanahova/status/1639325389955952640?s=46...
Deleted Comment