A Web UI for Stable Diffusion

ERROR: for stable-diffusion Cannot start service stable-diffusion: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

So, and this is an ELI5 kind of question I suppose. There must be something going on like "processing a kazillion images" and I'm trying to wrap my head around how (or what part of) that work is "offloaded" to your home computer/graphics card? I just can't seem to make sense of how you can do it at home if you're not somehow in direct contact with "all the data?" e.g. must you be connected to the internet, or "stable-diffusions servers" for this to work?

bootsmann · 4 years ago

You can think of it more like this: If I do 100 experiments of dropping stones at variable heights and measuring the time it takes for the stone to land on the ground I have enough datapoints to make a linear estimation of gravity by using linear regression. So based on my data I create a model that the time it takes for a stone to fall is sqrt(2h/9.81). Now if you want to figure out how long it takes for your stones to fall, you don’t need to redo all the experiments and can instead rely on the parameters I give you (say 9.81 in this case) to calculate it yourself.

With these models it works exactly the same way. Someone dropped millions of rocks and created a formula of unbelievable complexity and what they now did is they released that formula with all their calculated parameters into the world. What you do when you ultimately use Stable Diffusion is you just calculate the result of this formula and that is your image. You never have to process those images.

MattRix · 4 years ago

This is exactly it. It’s pretty remarkable that it was trained on over 100 terabytes of images and yet the model has been distilled down to only 4gb.

juliendorra · 4 years ago

That’s the interesting part: all the images generated are derived from a less than 4gb model (the trained weights of the neural network).

So in a way, hundreds of billions of possible images are all stored in the model (each a vector in multidimensional latent space) and turned into pixels on demand (drived by the language model that knows how to turn words into a vector in this space)

As it’s deterministic (given the exact same request parameters, random seed included, you get the exact same image) it’s a form of compression (or at least encoding decoding) too: I could send you the parameters for 1 million images that you would be able to recreate on your side, just as a relatively small text file.

nope96 · 4 years ago

So it's like a compiler which produces a 4GB executable file? And that 4GB is all the "logic" which can produce infinite possible images?

amelius · 4 years ago

> As it’s deterministic (given the exact same request parameters, random seed included, you get the exact same image) it’s a form of compression (or at least encoding decoding) too: I could send you the parameters for 1 million images that you would be able to recreate on your side, just as a relatively small text file.

For any input image? Or do you mean an image generated by the model?

dwohnitmok · 4 years ago

This is the main reason why attempts to say that these sorts of AI are just glorified lookup tables, or even that they are simply tools that mash together a kazillion images together are very misleading.

A kazillion images are used in training, but training consists of using those images to tune on the order of ~5 GB of weights and that is the entire size of the final model. Those images are never stored anywhere else and are discarded immediately after being used to tune the model. Those 5 GB generate all the images we see.

codefined · 4 years ago

All those 'kazillion' images are processed into a single 'model'. Similar to how our brain cannot remember 100% of all our experiences, this model will not store precise copies of all images it is trained off of. However, it will understand concepts, such as what a unicorn looks like.

For StableDiffusion, the current model is ~4GB, which is downloaded the first time you run the model. These 4GB encode all the information that the model requires to derive your images.

l33tman · 4 years ago

SD has 860M weights for the main workhorse part. At 16-bit precision that is only 1.6 GB of data, which in some very real sense has condensed the world's total knowledge of art and photography and styles and objects.

It's not a search engine, it's self-contained and the closest analogy is that it's a very very knowledgable and skilled artist.

password4321 · 4 years ago

Is there a smaller version of the model available (<4gb) intended for use with 16 bit precision?

sC3DX · 4 years ago

What you interact with as the user is the model and its weights.

The model (presumably some kind of convolutional neural network) has many layers, every layer has some set of nodes, and every node has a weight, which is just some coefficient. The weights are 'learned' during the model training where the model takes in the data you mention and evaluates the output. This typically happens on a super beefy computer and can take a long time for a model like this. As images are evaluated the output gets better the weights get adjusted accordingly.

Now we as the user just need the model and the weights!

mirekrusin · 4 years ago

It’s all offline in 4gb file on your local computer. It’s like mini brain trained to do just one/few specific tasks. Just like your own brain doesn’t need Wi-Fi to connect to global memory storage of everything you experienced since birth, same way this 4gb file doesn’t need anything extra.

culi · 4 years ago

A kazillion images are used to create/optimize a neural network (basically). What you're working with is the result of that training. These are the "weights"

ducktective · 4 years ago

As someone with ~0 knowledge in this field, I think this has to do with a concept called "transfer learning" in which you once train with that kazillion of images, then use that same "coefficients" for further run of the NN.

palunon · 4 years ago

Nah, transfer learning is when you take a trained model, and train it a little more to better fit your (potentially very different) problem domain. Such as training a cat/dog/etc recognition model on MRI scans.

The goal is usually to have the more fundamental parts of your model already working and you thus need way less domain specific data.

Here, you're not training anything, you're running the models (both the CLIP language model and the unet) in feedforward. That's just deploying your model, not transfer learning.

I’m waiting for someone to wrap this up into a desktop app that I can install and run on my Mac.

geuis · 4 years ago

I've been looking into this for the last 2 days. Unless you're running an M1 Mac or newer, you're SOL.

Stable Diffusion is built on PyTorch. PyTorch mainly has been designed to work with Nvidia cards. However PyTorch added support for something called RocM like a year ago that adds compatibility with newer AMD cards.

Unfortunately RocM doesn't support slightly older AMD cards in conjunction with intel processors.

So my 32gb pretty powerful 2020 16in MacBook Pro isn't capable of running Stable Diffusion.

Any native app will likely have to rely on a remote cloud gpu. And boy, those are fucking expensive. Been researching what I need to stand up a service the last few days and it isn't cost friendly.

kcartlidge · 4 years ago

> I've been looking into this for the last 2 days. Unless you're running an M1 Mac or newer, you're SOL.

And not just any old M1 Mac. Last week I got it running on my 2021 8GB M1 MacBook Air and it's slow. Images at 512x512 with 10 steps take between 7 and 10 minutes to generate.

It's the only thing I do that hits performance limitations on the 8GB machine so there's no regrets on that score, but with the way this stuff is progressing 16GB+ is a realistic minimum for comfortable use.

dreamcompiler · 4 years ago

I've been running the Intel CPU version [0] for a while now on a 2013 MacMini. Works fine; it takes several minutes per image but I can live with that.

[0] https://news.ycombinator.com/item?id=32642255

minimaxir · 4 years ago

There is work on a CoreML version which may play nicer with older Macs w/sufficiently beefy dGPUs.

https://github.com/huggingface/diffusers/issues/443

danuker · 4 years ago

> And boy, those are fucking expensive.

Unless you want to train the model, Lambda Labs is somewhat cheap:

https://lambdalabs.com/service/gpu-cloud

fragmede · 4 years ago

Apple's MPS drivers supports AMD GPUs on MacOS

westoncb · 4 years ago

I’ve been working on a queue-centric desktop app GUI for SD: https://twitter.com/westoncb/status/1568114946235580418?s=46...

I plan to wrap things up and put out the source this weekend.

fragmede · 4 years ago

https://www.charl-e.com/

There are a few bugs to iron out before it's ready for prime time. For now, create the folder `~/Desktop/charl-e/samples/` manually before you run it.

kcartlidge · 4 years ago

Great stuff, and works on an 8GB M1 Air taking between 5 and 10 minutes for between 5 and 15 steps. As a suggestion, perhaps add the option for setting the CFG too (I know, it's open source etc, but it's just a suggestion).

tasuki · 4 years ago

Funny, it probably does a whole lotta things, but it can't create the `~/Desktop/charl-e/samples/` directory? That seems like it should be relatively trivial...

kierangill · 4 years ago

Same! I wrote a public web app so that I could access the model from my phone [0]. This is how I found Replicate [1]. Their SD model is very cheap to use. While we wait for a native Mac app, I recommend accessing the model straight from their web UI.

[0] https://www.drawanything.app/

[1] https://replicate.com/

nowandlater · 4 years ago

This is the one I've been using https://github.com/sd-webui/stable-diffusion-webui . docker-compose up , works great.

jyrkesh · 4 years ago

I've also been using this one (wasn't sure at first, they just migrated from the /hlky/ namespace on GitHub), but I have no idea at first glance what the differences are.

I will say that this one has had REALLY active development as new features have been coming out, and is pretty polished at this point (albeit I'm using it more as a toy than anything, but it's awesome to have a quick way to use the new features that have been shipping out).

KronisLV · 4 years ago

Seems like it fails if you have an AMD GPU instead of an Nvidia one (at least that's my guess, based on the error contents):

mach1ne · 4 years ago

Indeed, Stable Diffusion does not currently run on AMD graphics cards.

TekMol · 4 years ago

Is there a way to run this in the cloud? On Google Colab or elsewhere?

Colab: https://colab.research.google.com/github/WASasquatch/StableD...

To run it elsewhere in the cloud, grab a GPU (spot) instance and SSH in.

wyldfire · 4 years ago

I like this one but had some trouble with using img2img. Maybe my image was too small (it was smaller than 512x512). Failed with the same signature as an issue that was closed with a fix.

ghilston · 4 years ago

I am mobile, but there's an issue reported on github about img2img

Dead Comment

jrm4 · 4 years ago

user-one1 · 4 years ago

It can be run directly into google colab: https://colab.research.google.com/drive/1Iy-xW9t1-OQWhb0hNxu...

When I run it, I get "Your session crashed for an unknown reason."

Just click runtime -> run all again. There’s a weirdness where Python’s loader gets confused and the most effective fix is to crash the interpreter

notfed · 4 years ago

Looks great but I use Linux and the README is fairly Windows-centric without warning. It'd be nice if there were clearly deliniated sections for Windows vs *nix.

There's a (very ironically named) "Manual installation" section which might seem to be the answer for Linux, but then it's not immediately obvious which preceding sections are Linux or Windows without doing critical thinking.

offsky · 4 years ago

hwers · 4 years ago

People recently figured out how to export stable diffusion to onnx so it’ll be exciting to see some actual web UIs for it soon (via quantized models and tfjs/onnxruntime for web)

Very cool! Can you link to where this is taking place?

A commenter mentioned today it might be possible to pre-download the model and load it into the browser from the local filesystem rather than include such a gigantic blob as an accompanying dependency, fighting different caching RFC's, security/usage restrictions, and anything else that might inadvertently trigger a re-download.

https://news.ycombinator.com/item?id=32777909#32779093

Support for ONNX export was just added to diffusers, but no runtime logic for scheduling yet.

https://github.com/huggingface/diffusers

underlines · 4 years ago

Nice to know tricks for /sd-webui/

- activate advanced: create prompt matrix and use

- add different relative weights for words in a prompt:

watercolor :0.5 painting :0.2 by picasso :0.3

- Generate much larger images with your limited vram by using optimized versions of attention.py and model.py

https://github.com/sd-webui/stable-diffusion-webui/discussio...

- Generate "Loab the AI haunting woman" if you can (Try using textual inversion with negatively weighted prompts)

https://www.cnet.com/science/what-is-loab-the-haunting-ai-ar...

- add GFPGAN to fix distorted faces

https://github.com/sd-webui/stable-diffusion-webui/wiki/Inst...

- add RealESRGAN for better upscaling

- add LDSR for crazy good upscaling (for 10x the processing time)

It seems Midjourney generates better results than SD or Dall-E.

What's with the "hyper resolution", "4K, detailed" adjectives which are thrown left and right, while we are at it?

schleck8 · 4 years ago

Those are prompt engineering keywords. SD is way more reliant on tinkering with the prompt than midjourney

https://moritz.pm/posts/parameters

throwaway1851 · 4 years ago

MidJourney needs a lot of prompt engineering too. And Dall-E also. If you look at the prompt as an opportunity to describe what you want to see, the results are often disappointing. It works better to think backwards about how the model was trained, and what sorts of web caption words it likely saw in training examples that used the sorts of features you’re hoping it will generate. This is more of a process of learning to ask the model to produce things it’s able to produce, using its special image language.

Ckalegi · 4 years ago

The metadata and file names of the images in the source data set are also inputs for the model training. These keywords are common tags across images that have these characteristics, so in the same way it knows what a unicorn looks like, it also knows what a 4k unicorn looks like compared to a hyper rez unicorn.

cma · 4 years ago

Midjourney uses SD under the hood (you can see in their license), but they augnment the model in various ways.

qudat · 4 years ago

The results in midjourney are significantly better than SD. I find it much easier to get to a good result in MJ and I've been trying to understand why. Anymore insight you could share?