> more than a single model and a lot of finetunes/high rank LoRAs
I can imagine a way might be found to host a base model and a bunch of LoRA's whilst using barely more ram than the base model alone.
The fine-tuning could perhaps be done in such a way that only perhaps 0.1% of the weights are changed, and for every computation the difference is computed not over the weights, but of the output layer activations.
I can imagine a way might be found to host a base model and a bunch of LoRA's whilst using barely more ram than the base model alone.
The fine-tuning could perhaps be done in such a way that only perhaps 0.1% of the weights are changed, and for every computation the difference is computed not over the weights, but of the output layer activations.
Disclaimer: I'm one of the authors.