The best VRAM calculator I have found is https://apxml.com/tools/vram-calculator. It is much more thorough than this one. For example, it understands different models' attention schemes for correct KV cache size calculation, and supports quantization of both the model and the KV cache. Also, fine-tuning. It has its own limitations, such as only supporting specific models. In practice though, the generic calculators are not very useful because model architectures vary (mainly the KV cache) and end up being way off. (Not sure whether or not it would be better to discuss it separately, but I submitted it at https://news.ycombinator.com/item?id=44677409)
It gives weird results for me. I’m using Qwen3-32B with 32K context length at Q4_K_M, with 8 bit KV cache fully offloaded to 24GB VRAM. According to this calculator this should be impossible by a large margin, yet it’s working for me.
Edit: this might be because I’ve got flash attention enabled in Ollama.
This one is indeed much better and it instantly answers my immediate feedback I wanted to leave for the one originally posted, which is - instead of calculating an artificial scenario I would like to state what can I run on the hardware I actually have at hand.
Thanks!
The training memory breakdown is wildly inaccurate.
- No one trains big models in FP32 anymore.
- Gradients can also often be in BF16, and they don't actually have to be stored if you're not using gradient accumulation or if you're accumulating them directly in the optimizer's state.
- 32-bit Adam is silly; if you don't have infinite VRAM there's no reason why you wouldn't want to use 8-bit Adam (or you can go even lower with quantized Muon)
- Activations? They take up memory too, but are not mentioned.
It shows that to train a 3.77B parameter model I need 62GB of VRAM; just to give you some perspective for how overestimated this is: a few weeks back I was training (full fine-tuning, not LoRA) a 14B parameter model on 24GB of VRAM using every trick in the book to lower VRAM usage (to be fair, not all of those tricks are available in publicly available training harnesses, but the point still stands that even with an off-the-shelf training harness you can do a lot better than what this calculator suggests).
Great points about training optimizations. For inference, similar dramatic memory reductions are possible through quantization (INT4/INT8) which can reduce VRAM needs by 2-8x compared to FP16, allowing much larger models on consumer GPUs.
No they're not? The process is essentially exactly the same, just with a much lower total FLOPs budget, since if you're not training from scratch then you don't need to train for as long. I can use *exactly* the same code that I used to fine-tune a model to train a new model from scratch; literally the only difference is whether I initialize the initial weights randomly or with an existing model, a couple of hyperparameters (e.g. for training from scratch you want to start at a higher LR), and training for longer.
Hate those ads. "Inference isn't just a buzzword". Who thought it was? (No comment on whether the linked post is a useful tool, I haven't played with it enough to know)
Who in the world is expected to populate 11 select/text fields with their favorite model data points they just happen to have lying around, only to see an absolutely meaningless "295% Inference" outcome
I would have liked to see the RTX 5060 Ti with 16GB mentioned. I can't tell if it's omitted because it won't work, or if it's excluded for some other reason?
Yeah, weird miss, but maybe just because it came out more recently. It can be used for ~anything a 5070 could be used for, no? Maybe slower, but still.
Edit: this might be because I’ve got flash attention enabled in Ollama.
- No one trains big models in FP32 anymore.
- Gradients can also often be in BF16, and they don't actually have to be stored if you're not using gradient accumulation or if you're accumulating them directly in the optimizer's state.
- 32-bit Adam is silly; if you don't have infinite VRAM there's no reason why you wouldn't want to use 8-bit Adam (or you can go even lower with quantized Muon)
- Activations? They take up memory too, but are not mentioned.
It shows that to train a 3.77B parameter model I need 62GB of VRAM; just to give you some perspective for how overestimated this is: a few weeks back I was training (full fine-tuning, not LoRA) a 14B parameter model on 24GB of VRAM using every trick in the book to lower VRAM usage (to be fair, not all of those tricks are available in publicly available training harnesses, but the point still stands that even with an off-the-shelf training harness you can do a lot better than what this calculator suggests).
Who in the world is expected to populate 11 select/text fields with their favorite model data points they just happen to have lying around, only to see an absolutely meaningless "295% Inference" outcome
What a dumpster
The default should be open and portable APIs, not needlessly furthering a hegemony that is detrimental to us all.