I've also been using this one (wasn't sure at first, they just migrated from the /hlky/ namespace on GitHub), but I have no idea at first glance what the differences are.
I will say that this one has had REALLY active development as new features have been coming out, and is pretty polished at this point (albeit I'm using it more as a toy than anything, but it's awesome to have a quick way to use the new features that have been shipping out).
I like this one but had some trouble with using img2img. Maybe my image was too small (it was smaller than 512x512). Failed with the same signature as an issue that was closed with a fix.
So, and this is an ELI5 kind of question I suppose. There must be something going on like "processing a kazillion images" and I'm trying to wrap my head around how (or what part of) that work is "offloaded" to your home computer/graphics card? I just can't seem to make sense of how you can do it at home if you're not somehow in direct contact with "all the data?" e.g. must you be connected to the internet, or "stable-diffusions servers" for this to work?
You can think of it more like this:
If I do 100 experiments of dropping stones at variable heights and measuring the time it takes for the stone to land on the ground I have enough datapoints to make a linear estimation of gravity by using linear regression. So based on my data I create a model that the time it takes for a stone to fall is sqrt(2h/9.81). Now if you want to figure out how long it takes for your stones to fall, you don’t need to redo all the experiments and can instead rely on the parameters I give you (say 9.81 in this case) to calculate it yourself.
With these models it works exactly the same way. Someone dropped millions of rocks and created a formula of unbelievable complexity and what they now did is they released that formula with all their calculated parameters into the world. What you do when you ultimately use Stable Diffusion is you just calculate the result of this formula and that is your image. You never have to process those images.
That’s the interesting part: all the images generated are derived from a less than 4gb model (the trained weights of the neural network).
So in a way, hundreds of billions of possible images are all stored in the model (each a vector in multidimensional latent space) and turned into pixels on demand (drived by the language model that knows how to turn words into a vector in this space)
As it’s deterministic (given the exact same request parameters, random seed included, you get the exact same image) it’s a form of compression (or at least encoding decoding) too: I could send you the parameters for 1 million images that you would be able to recreate on your side, just as a relatively small text file.
> As it’s deterministic (given the exact same request parameters, random seed included, you get the exact same image) it’s a form of compression (or at least encoding decoding) too: I could send you the parameters for 1 million images that you would be able to recreate on your side, just as a relatively small text file.
For any input image? Or do you mean an image generated by the model?
This is the main reason why attempts to say that these sorts of AI are just glorified lookup tables, or even that they are simply tools that mash together a kazillion images together are very misleading.
A kazillion images are used in training, but training consists of using those images to tune on the order of ~5 GB of weights and that is the entire size of the final model. Those images are never stored anywhere else and are discarded immediately after being used to tune the model. Those 5 GB generate all the images we see.
All those 'kazillion' images are processed into a single 'model'. Similar to how our brain cannot remember 100% of all our experiences, this model will not store precise copies of all images it is trained off of. However, it will understand concepts, such as what a unicorn looks like.
For StableDiffusion, the current model is ~4GB, which is downloaded the first time you run the model. These 4GB encode all the information that the model requires to derive your images.
SD has 860M weights for the main workhorse part. At 16-bit precision that is only 1.6 GB of data, which in some very real sense has condensed the world's total knowledge of art and photography and styles and objects.
It's not a search engine, it's self-contained and the closest analogy is that it's a very very knowledgable and skilled artist.
What you interact with as the user is the model and its weights.
The model (presumably some kind of convolutional neural network) has many layers, every layer has some set of nodes, and every node has a weight, which is just some coefficient. The weights are 'learned' during the model training where the model takes in the data you mention and evaluates the output. This typically happens on a super beefy computer and can take a long time for a model like this. As images are evaluated the output gets better the weights get adjusted accordingly.
Now we as the user just need the model and the weights!
It’s all offline in 4gb file on your local computer. It’s like mini brain trained to do just one/few specific tasks. Just like your own brain doesn’t need Wi-Fi to connect to global memory storage of everything you experienced since birth, same way this 4gb file doesn’t need anything extra.
A kazillion images are used to create/optimize a neural network (basically). What you're working with is the result of that training. These are the "weights"
As someone with ~0 knowledge in this field, I think this has to do with a concept called "transfer learning" in which you once train with that kazillion of images, then use that same "coefficients" for further run of the NN.
Nah, transfer learning is when you take a trained model, and train it a little more to better fit your (potentially very different) problem domain. Such as training a cat/dog/etc recognition model on MRI scans.
The goal is usually to have the more fundamental parts of your model already working and you thus need way less domain specific data.
Here, you're not training anything, you're running the models (both the CLIP language model and the unet) in feedforward. That's just deploying your model, not transfer learning.
Looks great but I use Linux and the README is fairly Windows-centric without warning. It'd be nice if there were clearly deliniated sections for Windows vs *nix.
There's a (very ironically named) "Manual installation" section which might seem to be the answer for Linux, but then it's not immediately obvious which preceding sections are Linux or Windows without doing critical thinking.
I've been looking into this for the last 2 days. Unless you're running an M1 Mac or newer, you're SOL.
Stable Diffusion is built on PyTorch. PyTorch mainly has been designed to work with Nvidia cards. However PyTorch added support for something called RocM like a year ago that adds compatibility with newer AMD cards.
Unfortunately RocM doesn't support slightly older AMD cards in conjunction with intel processors.
So my 32gb pretty powerful 2020 16in MacBook Pro isn't capable of running Stable Diffusion.
Any native app will likely have to rely on a remote cloud gpu. And boy, those are fucking expensive. Been researching what I need to stand up a service the last few days and it isn't cost friendly.
> I've been looking into this for the last 2 days. Unless you're running an M1 Mac or newer, you're SOL.
And not just any old M1 Mac. Last week I got it running on my 2021 8GB M1 MacBook Air and it's slow. Images at 512x512 with 10 steps take between 7 and 10 minutes to generate.
It's the only thing I do that hits performance limitations on the 8GB machine so there's no regrets on that score, but with the way this stuff is progressing 16GB+ is a realistic minimum for comfortable use.
I've been running the Intel CPU version [0] for a while now on a 2013 MacMini. Works fine; it takes several minutes per image but I can live with that.
Great stuff, and works on an 8GB M1 Air taking between 5 and 10 minutes for between 5 and 15 steps. As a suggestion, perhaps add the option for setting the CFG too (I know, it's open source etc, but it's just a suggestion).
Funny, it probably does a whole lotta things, but it can't create the `~/Desktop/charl-e/samples/` directory? That seems like it should be relatively trivial...
Same! I wrote a public web app so that I could access the model from my phone [0]. This is how I found Replicate [1]. Their SD model is very cheap to use. While we wait for a native Mac app, I recommend accessing the model straight from their web UI.
People recently figured out how to export stable diffusion to onnx so it’ll be exciting to see some actual web UIs for it soon (via quantized models and tfjs/onnxruntime for web)
Very cool! Can you link to where this is taking place?
A commenter mentioned today it might be possible to pre-download the model and load it into the browser from the local filesystem rather than include such a gigantic blob as an accompanying dependency, fighting different caching RFC's, security/usage restrictions, and anything else that might inadvertently trigger a re-download.
MidJourney needs a lot of prompt engineering too. And Dall-E also. If you look at the prompt as an opportunity to describe what you want to see, the results are often disappointing. It works better to think backwards about how the model was trained, and what sorts of web caption words it likely saw in training examples that used the sorts of features you’re hoping it will generate. This is more of a process of learning to ask the model to produce things it’s able to produce, using its special image language.
The metadata and file names of the images in the source data set are also inputs for the model training. These keywords are common tags across images that have these characteristics, so in the same way it knows what a unicorn looks like, it also knows what a 4k unicorn looks like compared to a hyper rez unicorn.
The results in midjourney are significantly better than SD. I find it much easier to get to a good result in MJ and I've been trying to understand why. Anymore insight you could share?
I will say that this one has had REALLY active development as new features have been coming out, and is pretty polished at this point (albeit I'm using it more as a toy than anything, but it's awesome to have a quick way to use the new features that have been shipping out).
To run it elsewhere in the cloud, grab a GPU (spot) instance and SSH in.
Dead Comment
With these models it works exactly the same way. Someone dropped millions of rocks and created a formula of unbelievable complexity and what they now did is they released that formula with all their calculated parameters into the world. What you do when you ultimately use Stable Diffusion is you just calculate the result of this formula and that is your image. You never have to process those images.
So in a way, hundreds of billions of possible images are all stored in the model (each a vector in multidimensional latent space) and turned into pixels on demand (drived by the language model that knows how to turn words into a vector in this space)
As it’s deterministic (given the exact same request parameters, random seed included, you get the exact same image) it’s a form of compression (or at least encoding decoding) too: I could send you the parameters for 1 million images that you would be able to recreate on your side, just as a relatively small text file.
For any input image? Or do you mean an image generated by the model?
A kazillion images are used in training, but training consists of using those images to tune on the order of ~5 GB of weights and that is the entire size of the final model. Those images are never stored anywhere else and are discarded immediately after being used to tune the model. Those 5 GB generate all the images we see.
For StableDiffusion, the current model is ~4GB, which is downloaded the first time you run the model. These 4GB encode all the information that the model requires to derive your images.
It's not a search engine, it's self-contained and the closest analogy is that it's a very very knowledgable and skilled artist.
The model (presumably some kind of convolutional neural network) has many layers, every layer has some set of nodes, and every node has a weight, which is just some coefficient. The weights are 'learned' during the model training where the model takes in the data you mention and evaluates the output. This typically happens on a super beefy computer and can take a long time for a model like this. As images are evaluated the output gets better the weights get adjusted accordingly.
Now we as the user just need the model and the weights!
The goal is usually to have the more fundamental parts of your model already working and you thus need way less domain specific data.
Here, you're not training anything, you're running the models (both the CLIP language model and the unet) in feedforward. That's just deploying your model, not transfer learning.
There's a (very ironically named) "Manual installation" section which might seem to be the answer for Linux, but then it's not immediately obvious which preceding sections are Linux or Windows without doing critical thinking.
Stable Diffusion is built on PyTorch. PyTorch mainly has been designed to work with Nvidia cards. However PyTorch added support for something called RocM like a year ago that adds compatibility with newer AMD cards.
Unfortunately RocM doesn't support slightly older AMD cards in conjunction with intel processors.
So my 32gb pretty powerful 2020 16in MacBook Pro isn't capable of running Stable Diffusion.
Any native app will likely have to rely on a remote cloud gpu. And boy, those are fucking expensive. Been researching what I need to stand up a service the last few days and it isn't cost friendly.
And not just any old M1 Mac. Last week I got it running on my 2021 8GB M1 MacBook Air and it's slow. Images at 512x512 with 10 steps take between 7 and 10 minutes to generate.
It's the only thing I do that hits performance limitations on the 8GB machine so there's no regrets on that score, but with the way this stuff is progressing 16GB+ is a realistic minimum for comfortable use.
[0] https://news.ycombinator.com/item?id=32642255
https://github.com/huggingface/diffusers/issues/443
Unless you want to train the model, Lambda Labs is somewhat cheap:
https://lambdalabs.com/service/gpu-cloud
I plan to wrap things up and put out the source this weekend.
There are a few bugs to iron out before it's ready for prime time. For now, create the folder `~/Desktop/charl-e/samples/` manually before you run it.
[0] https://www.drawanything.app/
[1] https://replicate.com/
A commenter mentioned today it might be possible to pre-download the model and load it into the browser from the local filesystem rather than include such a gigantic blob as an accompanying dependency, fighting different caching RFC's, security/usage restrictions, and anything else that might inadvertently trigger a re-download.
https://news.ycombinator.com/item?id=32777909#32779093
https://github.com/huggingface/diffusers
- activate advanced: create prompt matrix and use
@a painting of a (forest|desert|swamp|island|plains) painted by (claude monet|greg rutkowski|thomas kinkade)
- add different relative weights for words in a prompt:
watercolor :0.5 painting :0.2 by picasso :0.3
- Generate much larger images with your limited vram by using optimized versions of attention.py and model.py
https://github.com/sd-webui/stable-diffusion-webui/discussio...
- Generate "Loab the AI haunting woman" if you can (Try using textual inversion with negatively weighted prompts)
https://www.cnet.com/science/what-is-loab-the-haunting-ai-ar...
https://github.com/sd-webui/stable-diffusion-webui/wiki/Inst...
- add RealESRGAN for better upscaling
https://github.com/sd-webui/stable-diffusion-webui/wiki/Inst...
- add LDSR for crazy good upscaling (for 10x the processing time)
https://github.com/sd-webui/stable-diffusion-webui/wiki/Inst...
What's with the "hyper resolution", "4K, detailed" adjectives which are thrown left and right, while we are at it?
https://moritz.pm/posts/parameters