It supports both txt2img and img2img. (Not affiliated.)
Edit: Incidentally, I tried running it on a CPU. It is possible, but it took 3 minutes instead of 10 seconds to produce an image. It also required me to hack up the script in a really gross way. Perhaps there is a script somewhere that properly supports this.
I do runs at 384px by 384px, with batch size of 1. Sampling method has almost no impact on memory. Using k_euler with 30 steps renders an image in 10 to 20 seconds. The biggest thing that affect rending speed is the steps and the resolution, so 512x512 with C 50 using ddim is much slower than 256x256 with C 25 using k_euler.
The sampling methods run mostly in the same timelines, but the k_euler one can produce viable output at lower C values, meaning it is faster than the rest.
Don't add gfpgan in the same pipeline, as it takes more vram.
I'm running it on Windows 10 with latest drivers. I set the python process to Realtime priority in task manager (makes a slight difference!). Have not tried it on Linux.
Nah that's normal. It's why GPUs are the usual thing for AI. Any crap, old, weak gpu with 4gb memory would run circles around a cpu
It's often easier to actually get models to run on CPU, due to simpler install configs and more available memory. Just painful to get a result out of it. Which might help keep the install simple, because it's not even worth optimizing
How do you set up the model? Instructions only say "Download the model checkpoint. (e.g. from huggingface)", but I can't find instructions there on how to find a ckpt file, nor exactly what file should I look for.
Takes 3 minutes (for a prompt resulting in a set of 4 images) on my 1080 as well. Really astonished that it takes GP about the same time using just a CPU. Seems like the older generation of GPUs isn't much better than CPUs in regards to ML stuff.
To add another data point, my GTX1080 takes ~60 sec to generate a pair of 500x500 images using txt2img. Haven't tried img2img yet as the UI package I went with is a bit buggy with it
Not sure exact pricing but look for a used maxwell (geforce 1000 series) nvidia gpu i'd bet. A quadro m2000 with 4gb of ram was about 100 on ebay a short bit ago
old.reddit is truly horrible on mobile. Once you click on an image you can't go back. Off topic, but what is the other alternative UI called that people sometimes use?
I've been playing with this for a few hours. It's slow going -- you really need a fast GPU with a lot of RAM to make this very usable.
I ended up paying the $10 for Google Colab Pro and that's how I've been using this. Maybe I'll figure out how to get this working on my old 1080 TI to see if it's faster.
What I really wish was that the img2img tool could be used to take a text2img output and then "refine" it further. As it is, the img2img tool doesn't seem particularly great.
People on Reddit are talking about "I just generate 100 images and pick the best one"... but this is incredibly slow on the P100 GPU that Google has me on. Does this just require a monster GPU like a 3080/3090 in order to get any decent results?
Also how slow is your p100? I'm usually getting around 3 it/s. Maybe it's just because I'm used to disco diffusion where a single image took over an hour, but this is ungodly fast to me
FWIW I'm using an old gtx 1080 Ti to play around, it takes about 21 seconds per image. You can make it go even faster by lowering the timesteps taken from the default 50 (--ddim_steps), though both lowering and raising the value can result in quite different first-iteration images (though they tend to be similar) and seems to guarantee totally different further iteration images (as counted by --n_iter)... I'm with you on the feeling that it's hard to control, whether in refinement or in other ways, but I suspect that'll get a lot better in the next couple years (if not weeks or dare I say days).
You're probably using the default PLMS sampler with 50 steps. There are better samplers, the best seem to be Euler (more predictable in regards to the number of steps) and Euler ancestral (gives more variation). Both typically need much less steps to converge, speeding up the generation.
HuggingFace is a company that mainly builds open-source libraries and platforms to support open-source ML projects. They started out with their famous Transformers library and have many other libraries including the diffusion model one that is actually what this application here is using. They also have this model/dataset hub and interactive application platform known as "Spaces". Their goal is to be the "GitHub of machine learning".
Their business model is basically supporting enterprise and private use-cases. For example, getting expert support for using these libraries, or hosting models and datasets privately. You can see more information about the pricing here:
https://huggingface.co/pricing
They reached a $2 billion valuation after a recent round of funding so overall they're probably pretty flush with cash lol
complaining about the price of getting their images done. So part of it may actually exchanging money for a service. I bet a good chunk of it is investor cash though.
I've been trying to get some sensible images out of my descriptions, but I fail miserably.
In this case I had the prompt "cow chewing bone" with 4 squares representing the two pair of feet, the body and the head. None cared about chewing on a bone.
With DALL·E 2 I tried to get an image of a little girl building sandcastles and a monster threatening her:
"little scared girl building a sandcastle and a big angry monster is looking at her."
"little scared girl building a sandcastle six damaged sandcastles are to her side. a big angry monster is threatening her. it is dark." https://imgur.com/a/f5FFKOi
"little scared girl building a sandcastle with six damaged sandcastles to her side and a big angry monster threatening her"
Is there some kind of structure the sentences should follow?
Yes, checkout examples on lexica or use a prompt builder to help, like promptmania.
Also, most of the good ones you see online are cherry picked from hundreds of runs, so set your batch size too 1000 and go to bed! After that, people then tend to run some of the good results through img2img, also with a lot of variations produced from a single image. Finally, some people also run them at higher resolutions if they have enough VRAM, as smaller resolution can distort or generate rubbish. For the messed up faces, they run it through gfpgan a few times to get prettier faces. Other than that, it is pure luck (using random seeds) to figure out what works and what doesn't. Use the 2 sites above to help you improve your prompts.
Just know that if you let it run over night often, you will see it on your electricity bill. My GTX 1660 runs at max while rendering, which is 125W. Leaving it running over night can easily eat 2 to 6 kw's, depending on your system.
I managed to get one that was correct with "A little scared girl is building a sandcastle, while a monster is looking at her. Award-winning photograph.", but I couldn't figure out a phrasing where it wouldn't most of the time get confused thinking that the sandcastle is the monster, or that the girl is the monster.
Dalle is bad at being instructed to have an exact count of items in the picture. Ask for 6 kittens and you get 7 & each kitten will be much more "wrong" than a piture of a single kitten.
Dalle is is bad at positional prompts. Ask for somethi g to be in the top rightbhand corner and it will appear bottom centre
https://github.com/hlky/stable-diffusion
It supports both txt2img and img2img. (Not affiliated.)
Edit: Incidentally, I tried running it on a CPU. It is possible, but it took 3 minutes instead of 10 seconds to produce an image. It also required me to hack up the script in a really gross way. Perhaps there is a script somewhere that properly supports this.
https://github.com/magnusviri/stable-diffusion
I do runs at 384px by 384px, with batch size of 1. Sampling method has almost no impact on memory. Using k_euler with 30 steps renders an image in 10 to 20 seconds. The biggest thing that affect rending speed is the steps and the resolution, so 512x512 with C 50 using ddim is much slower than 256x256 with C 25 using k_euler.
The sampling methods run mostly in the same timelines, but the k_euler one can produce viable output at lower C values, meaning it is faster than the rest.
Don't add gfpgan in the same pipeline, as it takes more vram.
I'm running it on Windows 10 with latest drivers. I set the python process to Realtime priority in task manager (makes a slight difference!). Have not tried it on Linux.
I'm thinking about getting a 3090 so that I can make higher resolution images.
Gfpgan runs much faster for me 5 seconds per picture
It's often easier to actually get models to run on CPU, due to simpler install configs and more available memory. Just painful to get a result out of it. Which might help keep the install simple, because it's not even worth optimizing
Deleted Comment
Deleted Comment
https://old.reddit.com/r/StableDiffusion/comments/wy7oa5/img...
https://old.reddit.com/r/StableDiffusion/comments/wyq04v/usi...
https://old.reddit.com/r/StableDiffusion/comments/wzlmty/its...
You can find the announcement tweet here: https://twitter.com/mishig25/status/1563226161924407298?s=20...
I ended up paying the $10 for Google Colab Pro and that's how I've been using this. Maybe I'll figure out how to get this working on my old 1080 TI to see if it's faster.
Anyway, for the one that I'm using which has a web UI, you can use this Colab link. It's pretty great! https://colab.research.google.com/drive/1KeNq05lji7p-WDS2BL-...
What I really wish was that the img2img tool could be used to take a text2img output and then "refine" it further. As it is, the img2img tool doesn't seem particularly great.
People on Reddit are talking about "I just generate 100 images and pick the best one"... but this is incredibly slow on the P100 GPU that Google has me on. Does this just require a monster GPU like a 3080/3090 in order to get any decent results?
Also how slow is your p100? I'm usually getting around 3 it/s. Maybe it's just because I'm used to disco diffusion where a single image took over an hour, but this is ungodly fast to me
Their business model is basically supporting enterprise and private use-cases. For example, getting expert support for using these libraries, or hosting models and datasets privately. You can see more information about the pricing here: https://huggingface.co/pricing
They reached a $2 billion valuation after a recent round of funding so overall they're probably pretty flush with cash lol
https://news.ycombinator.com/item?id=32634139
complaining about the price of getting their images done. So part of it may actually exchanging money for a service. I bet a good chunk of it is investor cash though.
In this case I had the prompt "cow chewing bone" with 4 squares representing the two pair of feet, the body and the head. None cared about chewing on a bone.
With DALL·E 2 I tried to get an image of a little girl building sandcastles and a monster threatening her:
"little scared girl building a sandcastle and a big angry monster is looking at her."
"little scared girl building a sandcastle six damaged sandcastles are to her side. a big angry monster is threatening her. it is dark." https://imgur.com/a/f5FFKOi
"little scared girl building a sandcastle with six damaged sandcastles to her side and a big angry monster threatening her"
Is there some kind of structure the sentences should follow?
Also, most of the good ones you see online are cherry picked from hundreds of runs, so set your batch size too 1000 and go to bed! After that, people then tend to run some of the good results through img2img, also with a lot of variations produced from a single image. Finally, some people also run them at higher resolutions if they have enough VRAM, as smaller resolution can distort or generate rubbish. For the messed up faces, they run it through gfpgan a few times to get prettier faces. Other than that, it is pure luck (using random seeds) to figure out what works and what doesn't. Use the 2 sites above to help you improve your prompts.
(meant in the context of stable diffusion)
Dalle is is bad at positional prompts. Ask for somethi g to be in the top rightbhand corner and it will appear bottom centre