Readit News logoReadit News
drschwabe · 2 years ago
ControlNet model specifically the scribble ControlNet (and ComfyUI) was major gamechanger for me.

Was getting good results with just SD and occassional masking but it would take hours and hours to hone in and composite a complex scene with specific requirements & shapes (with most of the work spent currating the best outputs and then blending them into a scene with Gimp/Inkscape).

Masking is unintuitive compared to the scribble which gets similar effect; no need to paint masks (which is disruptive to the natural process of 'drawing' IMO) instead just make a general black and white outline of your scene. Simply dial up/down the conditioning strength to have it more tightly or fuzzily follow that outline.

You can also use Gimp's Threshold or Inkscape Trace Bitmap tool to get a decent black & white outline from an existing bitmap to expedite the scribble procedure.

fsloth · 2 years ago
Comfy ui is really nice. The fact that the node graph is saved as png metadata actually makes node based workflows super fluent and reproducible since all you need to do to get the graph for your image is to drag and drop the result png to the gui. This feels like a huge quality-of-life improvement compared to any other lightweight node tools I’ve used.
drschwabe · 2 years ago
Yeah the PNG embedded 'drag and drop to restore your config' is brilliant.

Reminds me of Fireworks which Adobe killed off (after putting out a decent update or two to be fair) which used PNGs for layers and meta ala PSD format.

But its more analogous to a 3D modeller suite like Blender or Maya but with theoretical feature such that you could take a rendered output image and dragndrop it back into the 3D viewport and have it restore all the exact render settings you used instantly back. That would be handy!

l33tman · 2 years ago
You don't need to go through Gimp or Inkscape, this is built-in to the auto1111 ControlNet UI. You just dump the existing photo there and you can select a bunch of pre-processors like edge-detection or 3D depth extraction, which is then fed into ControlNet to generate a new image.

This is super powerful for example visualizing the renovation of an apartment room or house exterior.

drschwabe · 2 years ago
Will have to play with those more thx for the headsup; I do find however for scribble outlines I like to often draw my own lines by hand instead of an auto-generated one to emphasize the absolute key areas that would not otherwise be auto-identified. Logo and 2D design for example where you may have very specifc text shape that needs be preserved regardless of contrast or perceivable depth.
gkeechin · 2 years ago
That's for sure - I think we have seen other kind of edge detector or filter work better for differing use cases, especially around foreground images where you want to retain more information (i.e. images with small nitty-gritty details)

In this post, we just seek to showcase the fastest way to do it - and how augmentation may potentially help vary the position!

moelf · 2 years ago
any tutorial you would recommend? I found https://comfyanonymous.github.io/ComfyUI_examples/controlnet...
drschwabe · 2 years ago
Yeah that tutorial is decent its what I used to get going.

Note that all of the images in those comfy tutorials (except for images of the UI itself) can be dragndropped into ComfyUI and you'll get the entire node layout you can use to understand how it works.

Another good resource is civit.ai and specifically look for images that have a comfy UI embedded metadata. I made a feature request that they create a tag for uploaders to flag comfyUI pngs but not sure if they've added that yet. Or caroose Reddit or Discord for people sharing PNGs with comfy embeds.

Trying out different models (also avail from civit) is a good way to get an understanding of how swapping out models affects performance and the results. I've been abusing Absolutereality (v1.81) + More Details LORA because its just so damn fast and the results are great for almost any requirement I throw at it. AI moves so fast but I don't even bother updating the models anymore there is just so much potential in the models we already have; more pay off would be mastering other techniques like the depth map Control Net.

I would say that above all extensive familiarity with an image editor like Photoshop, Gimp, or Krita - will get you the most mileage particularly if you have specific needs beyond just fun and concepting. AI art makes artists better, people who struggle with image editing will struggle to maximize this new tech just as people who struggle with code will have issues maintaining the code Copilot or ChatGPT is spitting out (versus a coder who will refactor and fine tune before integrating to the rest of their application).

shostack · 2 years ago
Is there any solution for consistency yet that goes beyond form and structure and gets things like outfits, color, and facial features consistent in an easy way to compose scenes with multiple consistent characters?
dragonwriter · 2 years ago
LoRA for specific items/characters + regional prompting covers a lot of that area.
rvion · 2 years ago
I'm building CushyStudio https://github.com/rvion/cushystudio#readme to make Stable Diffusion practical and fun to play with.

It's still a bit rough around the corners, and I haven't properly launched it yet, but if you want to play with ControlNets, pre-processors, IP adapters, and all those various SD technologies, it's a pretty fun tool ! I personally use for real-time scribble to image, things like this :)

(will post that properly on HN in a few days / week I think, when early feedback will have been properly addressed)

bavell · 2 years ago
Looking forward to your launch, I found cushystudio awhile back (maybe from HN?) and cannibalized some of the type generation code to make my own API wrapper for personal uses. Thanks!

I barely got it working in that early alpha but it was super helpful for me as a reference. I'll give it another go now that it's further along, it seemed very promising and I liked your workflow approach

imranhou · 2 years ago
The versatility of Stable Diffusion, especially when combined with tools like ControlNet, highlights the advantages of a more controlled image generation process. While DALL-E and others provide ease and speed, the depth of customization and local processing capabilities of SD models cater to those seeking deeper creative control and independence.
ChatGTP · 2 years ago
It is interesting isn't it? Because we have "AI" generating the image, but we still seem to want to "paint" or have control over the creative process.

Prompts seem to be a new type of camera, lens or paintbrush.

barrkel · 2 years ago
There's at least three "levels" you can consider with image generation: composition, facial likeness and style. Prompts are pretty weak at composition and are the strongest point of controlnets - they do a great deal to make up for the weakness. But there are some compositions SD can't find even when given detailed controlnets.

Style generality is frequently lost in fine-tuned models. The original dreambooth tried to get around this by generating lots of images of the class to retain generality, but it's time intensive to generate all the extra images (and ideally do some QC on them) and train on them too, so it's not often done.

Magi604 · 2 years ago
SD outputs have an "uncanny valley" type of quality to them. You just KNOW when an image is from SD. And I have looked at getting started with SD, but the requirements and setup and +-prompting "language" just kind of turned me off the whole thing.

Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.

I guess my point is to ask whether SD is worth bothering with at this time when DALL E and Imagen and possibly others are just on the brink of becoming mainstream and just going to get better and better. Clunking together something with SD seems unnecessary when you can generate more results, better results, in a faster way, with less requirements, and without the steep learning curve, by using other methods.

jyap · 2 years ago
One major benefit and the reason why I use the StableDiffusion tools and models is because I can run them at home on my relatively old NVIDIA 2080 GPU with 8GB of VRAM. Costs me nothing (besides electricity).

Depends if you value this kind of freedom in life.

You can do some things such as colorizing black and white images with the Recolor model.

https://huggingface.co/stabilityai/control-lora

NBJack · 2 years ago
I have to agree at how convenient and (long term) inexpensive this can be. I may not always get the greatest results right away, but it is fun to come up with some ideas, put them into a prompt iterator (or matrix), and run it overnight. I can tweak it to my heart's content.
gkeechin · 2 years ago
Very interesting - thank you for sharing this. Would love to explore this as a team and perhaps put out a blog on helping others get started with control-lora
Magi604 · 2 years ago
I mean, I'm running DALLE 3 on a browser from an old laptop and I've generated probably over 15k images in 2 weeks, spanning the gamut from memes to art to lewds (with jailbreaks). The ability to completely scrap what you're building and start totally fresh at the drop of a hat with a new line of ideas and get instant results seems pretty freeing to me.
DrSiemer · 2 years ago
You just KNOW when an image is from SD

No, you know when a beginner generated an image in Stable Diffusion. With enough skill and attention, you will not.

Sure, there is a learning curve and it takes more time to get to a good result. But in turn, it gives you control far beyond what the competition can offer.

smcleod · 2 years ago
I’m assuming you haven’t used SDXL?

Give it a go with invokeAI - you can create images that I guarantee you wouldn’t know were generated. Like anything (photography included) it’s a skill.

Examples:

  - https://civitai.com/images/2862100
  - https://civitai.com/images/2339666
  - https://civitai.com/images/2846876

ben_w · 2 years ago
I can see at least three finger issues with the couple in the cinema.

More than that though: I use SDXL quite a bit for fun, and while I like it and it can be very good, it's still prone to getting stuck in a David Cronenberg mode for reasons I can't solve.

moritonal · 2 years ago
At glance I get uncanny valley from two. After looking closer it's likely because with the photo of the couple at the cinema, the woman's arm around him is wearing the wrong clothes. Then photo of the guy with a hat, his neck piece is asymmetric.
vinckr · 2 years ago
I can run Stable Diffusion on my local machine. It is open source and weights are public, giving me in theory access to anything I want to modify.

I cant change anything on DALL E, I can just take the input or change the prompt.

Also it is a centralized service that can be shut down, modified, censored or become very expensive at any time.

NBJack · 2 years ago
Try SDXL. Find a good negative prompt, then just put a short sentence (starting with the kind of image, such as photograph, render, etc.) describing what you want in the positive prompt. It is much simpler and has fantastic results. Tweak to your hearts desire from there.

If you see a part of the scene that looks weird (and you know what it should be) add it to your prompt. For example, if you want "photo of a jungle in South America", and the foliage looks weird, add something like "with lush trees and ferns".

brucethemoose2 · 2 years ago
Try: https://github.com/lllyasviel/Fooocus

I also recommend a good photorealistic base model, like RealVis XL.

In my experience its like DALL E but straight up better, more customizable, and local. And thats before you start trying finetunes and LORAs.

Other UIs will do SDXL, but every one I tried is terrible without all those default fooocus augmentations.

AuryGlenz · 2 years ago
SDXL is great but it's in no way better than DALL E as far as straight text-to-image goes apart from the lack of censorship.

It has plenty of other advantages, but you can't tell it "make me a cute illustration of a 2 year old girl with Blaze from Blaze and the Monster Machines on a birthday cake with a large 2 candle on it."

DALL E will nail that, more or less. SDXL very much won't.

raincole · 2 years ago
> You just KNOW when an image is from SD.

You don't. People think they do, but they don't.

fassssst · 2 years ago
DALL-E within ChatGPT uses GPT-4 to rewrite what you ask for into a good text-to-image prompt. You could probably do something similar with Stable Diffusion with just a little upfront effort tuning that system prompt.
IanCal · 2 years ago
Somewhat, but dalle3 is hugely better at understanding a description and relationships.
dragonwriter · 2 years ago
> You could probably do something similar with Stable Diffusion with just a little upfront effort tuning that system prompt.

And, indeed, someone has:

https://github.com/sayakpaul/caption-upsampling

fsloth · 2 years ago
Depends what you want.

Dalle 3 is super good, but lacks the creative control controlnets and ip-adapter provide. So for instance afaik there is no way to perform style transfers, or ’paint a van gogh portrait over my pencil sketch’.

Both are good currently but at different things.

”Prompt engineering” is and will be total bs. Dalle3/chatgpt provides the actual workflow we want where we describe to the intelligent agent (chatpgt) what we want and it worries over the accidental-complexity-intricasies of the clip model itself.

zirgs · 2 years ago
Dall-E has the same problems as other models. Try generating a clockwork mechanism with it, for example.

SD is worth bothering with because it's open, you can run and extend it yourself.

vidarh · 2 years ago
You know when they're bad enough that you know.
orbital-decay · 2 years ago
That's funny to hear because DALL-E 3 mainly improves prompt understanding, it hallucinates like mad with faces and hands, and doesn't seem to do anything to improve them like Midjourney for example.

>Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.

Hyper-realistic, but is it what you want from it? Are you able to guide it into doing exactly what you want? If you have such requirements that just a natural language prompt is enough and is somehow faster than sketching and providing references, of course use it. I'm not so lucky, I don't get what I want from it, and no amount of prompt understanding will make it easier. Although SD/SDXL doesn't pass the quality bar either, not because it's not "detailed" or "hyper-realistic" enough, but because it doesn't pay attention to the things that should be prioritized, like linework or lighting. Neither does any other model. Controlnets and LoRAs alone aren't sufficient for controllability either, mostly because it's too small to understand high-level concepts. So I don't use anything.

ChildOfChaos · 2 years ago
Love all this AI stuff, would love to play more with it, but sadly I'm on a 2015 iMac, great for everything else I do but can't do this stuff.

It's pricey to get a windows machine + GPU and the cloud options seem a bit more limited and add up quickly too, but it is amazing tech.

newswasboring · 2 years ago
I have done a bunch of stable diffusion stuff on colab. The free version works if you are lucky enough to get a GPU. Used to happen more often before. But the premium colab isn't badly priced either.

Here is a colab link to open comfyUI

https://github.com/FurkanGozukara/Stable-Diffusion/blob/main...

ChildOfChaos · 2 years ago
They blocked this now on the free version of colab sadly.
imranq · 2 years ago
While SD is pretty interesting, I'm curious what do people use it for? Outside of custom greeting cards and backgrounds, it's not really precise enough for conceptual art nor is it consistent enough for animation.
Filligree · 2 years ago
Illustrating my fanfiction.
tayo42 · 2 years ago
With the luggage example it seems to only generate backgrounds where the lighting makes sense? That's kind of interesting. I was wondering how it would handle the highlight on the right.
minimaxir · 2 years ago
Giving Stable Diffusion constraints forces it to get creative.

It’s the best argument against “AI generated images are just collages”.

Der_Einzige · 2 years ago
This is a general result. For example, ChatGPT struggles hard with following lexical, syntactic, or phonetic constraints in prompts due to the tokenization scheme - see https://paperswithcode.com/paper/most-language-models-can-be...

LLMs + Diffusors are super charged when using techniques like constraints, controlnet, regional prompting, and related techniques.

zamalek · 2 years ago
In ComfyUI you could run the image through a style-to-style (sdxl refinement might even pull it off) model to change the lighting without changing the content. Or use another ControlNet. Your workflow can get arbitrarily complex.
Alifatisk · 2 years ago
If I have a large dataset or photos with my face? Can I generate my own images in different places and environments using this?
gbrits · 2 years ago
Yep. Lora's are the easiest way to go. Loads of tutorials on Youtube. This is a good one: https://www.youtube.com/watch?v=70H03cv57-o
Alifatisk · 2 years ago
Looks like it requires good hardware to run this? My GPU is too old for this.