Right-sizes LLM models to your system's RAM, CPU, and GPU

BloondAndDoom · 15 days ago

This pretty cool, and useful but I only wish this was a website. I don’t like the idea of running an executable for something that can perfectly be done as a website. (Other than some minor features, tbh even you can enable Corsair and still check the installed models from a web browser).

Sounds like a fun personal project though.

jasode · 15 days ago

>I only wish this was a website. I don’t like the idea of running an executable for something that can perfectly be done as a website.

The tool depends on hardware detection. From https://github.com/AlexsJones/llmfit?tab=readme-ov-file#how-... :

  How it works
  Hardware detection -- Reads total/available RAM via sysinfo, counts CPU cores, and probes for GPUs:

  NVIDIA -- Multi-GPU support via nvidia-smi. Aggregates VRAM across all detected GPUs. Falls   back to VRAM estimation from GPU model name if reporting fails.
  AMD -- Detected via rocm-smi.
  Intel Arc -- Discrete VRAM via sysfs, integrated via lspci.
  Apple Silicon -- Unified memory via system_profiler. VRAM = system RAM.
  Ascend -- Detected via npu-smi.
  Backend detection -- Automatically identifies the acceleration backend (CUDA, Metal, ROCm, SYCL, CPU ARM, CPU x86, Ascend) for speed estimation.

Therefore, a website running Javascript is restricted by the browser sandbox so can't see the same low-level details such as total system RAM, exact count of GPUs, etc,

To implement your idea so it's only a website and also workaround the Javascript limitations, a different kind of workflow would be needed. E.g. run macOS system report to generate a .spx file, or run Linux inxi to generate a hardware devices report... and then upload those to the website for analysis to derive a "LLM best fit". But those os report files may still be missing some details that the github tool gathers.

Another way is to have the website with a bunch of hardware options where the user has to manually select the combination. Less convenient but then again, it has the advantage of doing "what-if" scenarios for hardware the user doesn't actually have and is thinking of buying.

(To be clear, I'm not endorsing this particular github tool. Just pointing out that a LLMfit website has technical limitations.)

CoolGuySteve · 15 days ago

That’s like like 4 or 5 fields to fill in on a form. Way less intrusive than installing this thing

seemaze · 15 days ago

I just discovered the other day the hugging face allows you to do exactly this.

With the caveat that you enter your hardware manually. But are we really at the point yet where people are running local models without knowing what they are running them on..?

mongrelion · 14 days ago

> But are we really at the point yet where people are running local models without knowing what they are running them on..?

I can only speak for myself: it can be daunting for a beginner to figure out which model fits your GPU, as the model size in GB doesn't directly translate to your GPU's VRAM capacity.

There is value in learning what fits and runs on your system, but that's a different discussion.

roxolotl · 14 days ago

The other nice part of huggingface’s setup is you can add theoretical hardware and search that way too.

mmmlinux · 14 days ago

People out there are probably vibecoding their username / passwords for websites. Don't under estimate dumb people.

Trigg3r · 15 days ago

Came across a website for this recently that may be worth a look https://whatmodelscanirun.com

Tepix · 15 days ago

It's wildly inaccurate for me.

hhh · 15 days ago

Huggingface has it built in.

azinman2 · 15 days ago

Where?

binsquare · 15 days ago

here's an website for a community-ran db on LLM models with details on configs for their token/s: https://inferbench.com/

mongrelion · 14 days ago

Great idea of inferbench (similar to geekbench, etc.) but as of the time of writing, it's got only 83 submissions, which is underwhelming.

Natfan · 14 days ago

i wouldn't mind a set of well-known unix commands that produce a text output of your machine stats to paste into this hypothetical website of yours (think: neofetch?)

hidelooktropic · 14 days ago

The whole point is to measure your hardware capability. How would you do that as a website?

kristopolous · 15 days ago

always liked this website that kinda does something similar https://apxml.com/tools/vram-calculator

omneity · 15 days ago

This is a great project. FYI all you need is the size of an LLM and the memory amount & bandwidth to know if it fits and the tok/s

It’s a simple formula:

llm_size = number of params * size_of_param

So a 32B model in 4bit needs a minimum of 16GB ram to load.

Then you calculate

tok_per_s = memory_bandwidth / llm_size

An RTX3090 has 960GB/s, so a 32B model (16GB vram) will produce 960/16 = 60 tok/s

For an MoE the speed is mostly determined by the amount of active params not the total LLM size.

Add a 10% margin to those figures to account for a number of details, but that’s roughly it. RAM use also increases with context window size.

zozbot234 · 15 days ago

> RAM use also increases with context window size.

KV cache is very swappable since it has limited writes per generated token (whereas inference would have to write out as much as llm_active_size per token, which is way too much at scale!), so it may be possible to support long contexts with quite acceptable performance while still saving RAM.

Make sure also that you're using mmap to load model parameters, especially for MoE experts. It has no detrimental effect on performance given that you have enough RAM to begin with, but it allows you to scale up gradually beyond that, at a very limited initial cost (you're only replacing a fraction of your memory_bandwidth with much lower storage_bandwidth).

0xbadcafebee · 14 days ago

Well mmap can still cause issues if you run short on RAM, and the disk access can cause latency and overall performance issues. It's better than nothing though.

Agree that k/v cache is underutilized by most folks. Ollama disables Flash Attention by default, so you need to enable it. Then the Ollama default quantization for k/v cache is fp16, you can drop to q8_0 in most cases. (https://mitjamartini.com/posts/ollama-kv-cache-quantization/) (https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...)

kittikitti · 14 days ago

This is a good rule of thumb. I would also include that in most cases, RAM use exponentially increases with context window size.

namibj · 14 days ago

There's zero exponential scaling involved. There is quadratic compute and reasonably log-linear storage, though.

escapeteam · 14 days ago

Thanks for the formula, I wasn't aware of it.

kamranjon · 15 days ago

This is a great idea, but the models seem pretty outdated - it's recommending things like qwen 2.5 and starcoder 2 as perfect matches for my m4 macbook pro with 128gb of memory.

mittermayr · 15 days ago

this is visually fantastic, but while trying this out, it says I can't run Qwen 3.5 on my machine, while it is running in the background currently, coding. So, not sure what the true value of a tool like this is other than getting a first glimpse, perhaps. Also, with unsloth providing custom adjustments, some models that are listed as undoable become doable, and they're not in the tool. Again, not trying to be harsh, it's just a really hard thing to do properly. And like many other similar tools, the maintainer here will also eventually struggle with the fact that models are popping up left and right faster than they can keep up with it.

kittikitti · 14 days ago

You might be swapping out neural weights between disk and RAM. I think people in a year or two will realize why their disks have been failing prematurely, or perhaps you too.

est · 15 days ago

Why do I need to download & run to checkout?

Can I just submit my gear spec in some dropdowns to find out?

7777777phil · 13 days ago

The "biggest model that fits" instinct is just wrong now. Compact models routinely beat massive predecessors from 12 months ago. Scaling laws only reliably predict pre-training loss anyway, not how the model actually performs on your task. Dug into the research behind this: https://philippdubach.com/posts/the-most-expensive-assumptio...

andsoitis · 15 days ago

Claude is pretty good at among recommendations if you input your system specs.

codazoda · 15 days ago

I used this prompt and it suggested a model I already have installed and one other. I'm not sure if it's the "newest" answer.

> What is the best local LLM that I could run on this computer? I have Ollama (and prefer it) and I have LM Studio. I'm willing to install others, if it gives me better bang for my buck. Use bash commands to inspect the RAM and such. I prefer a model with tool calling.

0xbadcafebee · 15 days ago

This is probably catching ~85% of cases and you can possibly do better. For example, some AMD iGPUs are not covered by ROCm, so instead you rely on Vulkan support. In that case you can sometimes pass driver arguments to allow the driver to use system RAM to expand VRAM, or to specify the "correct" VRAM amount. (on iGPUs the system RAM and VRAM are physically the same thing) In this case you carefully choose how much system RAM to give up, and balance the two carefully (to avoid either OOM on one hand, or too little VRAM on the other). But do this and you can pick models that wouldn't otherwise load. Especially useful with layer offload and quantized MoE weights.