behohippy (u/behohippy)

behohippy commented on Benchmark Framework Desktop Mainboard and 4-node cluster github.com/geerlingguy/ol... · Posted by u/geerlingguy

KronisLV · 7 months ago

For comparison, at work we got a pair of Nvidia L4 GPUs: https://www.techpowerup.com/gpu-specs/l4.c4091

That gives us a total TDP of around 150W, 48 GB of VRAM and we can run Qwen 3 Coder 30B A3B at 4bit quantization with up to 32k context at around 60-70 t/s with Ollama. I also tried out vLLM, but the performance surprisingly wasn't much better (maybe under bigger concurrent load). Felt like sharing the data point, because of similarity.

Honestly it's a really good model, even good enough for some basic agentic use (e.g. with Aider, RooCode and so on), MoE seems the way to go for somewhat limited hardware setups.

Ofc obviously not recommending L4 cards cause they have a pretty steep price tag. Most consumer cards feel a bit power hungry and you'll probably need more than one to fit decent models in there, though also being able to game with the same hardware sounds pretty nice. But speaking of getting more VRAM, the Intel Arc Pro B60 can't come soon enough (if they don't insanely overprice it), especially the 48 GB variety: https://www.maxsun.com/products/intel-arc-pro-b60-dual-48g-t...

behohippy · 7 months ago

Yeah 48g, sub 200W seems like a sweet spot for a single card setup. Then you can stack as deep as you want to get the size of model you want for whatever you want to pay for the power bill.

behohippy commented on Ask HN: Do you think differently about working on open source these days? · Posted by u/gillyb

behohippy · 7 months ago

Sure, all the slop code projects I produce get MIT licensed on public repos. It wasn't mine to begin with, so I wouldn't prevent anyone from using it.

behohippy commented on Benchmark Framework Desktop Mainboard and 4-node cluster github.com/geerlingguy/ol... · Posted by u/geerlingguy

loudmax · 7 months ago

The short answer is that the best value is a used RTX 3090 (the long answer being, naturally, it depends). Most of the time, the bottleneck for running LLMs on consumer grade equipment is memory and memory bandwidth. A 3090 has 24GB of VRAM, while a 5080 only has 16GB of VRAM. For models that can fit inside 16GB of VRAM, the 5080 will certainly be faster than the 3090, but the 3090 can run models that simply won't fit on a 5080. You can offload part of the model onto the CPU and system RAM, but running a model on a desktop CPU is an enormous drag, even when only partially offloaded.

Obviously an RTX 5090 with 32GB of VRAM is even better, but they cost around $2000, if you can find one.

What's interesting about this Strix Halo system is that it has 128GB of RAM that is accessible (or mostly accessible) to the CPU/GPU/APU. This means that you can run much larger models on this system than you possibly could on a 3090, or even a 5090. The performance tests tend to show that the Strix Halo's memory bandwidth is a significant bottleneck though. This system might be the most affordable way of running 100GB+ models, but it won't be fast.

behohippy · 7 months ago

Used 3090s have been getting expensive in some markets. Another option is dual 5060ti 16 gig. Mine are lower powered, single 8 pin power, so they max out around 180W. With that I'm getting 80t/s on the new qwen 3 30b a3b models, and around 21t/s on Gemma 27b with vision. Cheap and cheerful setup if you can find the cards at MSRP.

behohippy commented on Deepseek R1-0528 huggingface.co/deepseek-a... · Posted by u/error404x

transcriptase · 10 months ago

Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses?

behohippy · 10 months ago

About 768 gigs of ddr5 RAM in a dual socket server board with 12 channel memory and an extra 16 gig or better GPU for prompt processing. It's a few grand just to run this thing at 8-10 tokens/s

behohippy commented on DeepSeek-V3 Technical Report arxiv.org/abs/2412.19437... · Posted by u/signa11

danielhanchen · a year ago

Re DeepSeek-V3 0324 - I made some 2.7bit dynamic quants (230GB in size) for those interested in running them locally via llama.cpp! Tutorial on getting and running them: https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-...

behohippy · a year ago

These articles are gold, thank you. I used your gemma one from a few weeks back to get gemma 3 performing properly. I know you guys are all GPU but do you do any testing on CPU/GPU mixes? I'd like to see the pp and t/s on pure 12 channel epyc and the same with using a 24 gig gpu to accelerate the pp.

behohippy commented on Building a personal, private AI computer on a budget ewintr.nl/posts/2025/buil... · Posted by u/thm

Eisenstein · a year ago

We are talking about dynamically quantizing KV cache, not the model weights.

behohippy · a year ago

I run the KV cache at Q8 even on that model. Is it not working well for you?

behohippy commented on Building a personal, private AI computer on a budget ewintr.nl/posts/2025/buil... · Posted by u/thm

Eisenstein · a year ago

It highly depends on the model and the context use. A model like command-r for instance is practically unaffected by it, but Qwen will go nuts. As well, tasks highly dependent on context like translation or evaluation will be more impacted than say, code generation or creative output.

behohippy · a year ago

Qwen is a little fussy about the sampler settings, but it does run well quantized. If you were getting infinite repetition loops, try dropping the top_p a bit. I think qwen likes lower temps too

behohippy commented on Building a personal, private AI computer on a budget ewintr.nl/posts/2025/buil... · Posted by u/thm

hexomancer · a year ago

Isn't the fact the P40 has horrible fp16 performance a deal breaker for local setups?

behohippy · a year ago

You probably won't be running fp16 anything locally. We typically run Q5 or Q6 quants to maximize the size of the model and context length we can run with the VRAM we have available. The quality loss is negligable at Q6.

behohippy commented on Ask HN: Is anyone doing anything cool with tiny language models? · Posted by u/prettyblocks

bithavoc · a year ago

this is so cool, any chance you post a video?

behohippy · a year ago

Just this pic: https://imgur.com/ip8GWIh

behohippy commented on Ask HN: Is anyone doing anything cool with tiny language models? · Posted by u/prettyblocks

Dansvidania · a year ago

this sounds pretty cool, do you have any video/media of it?

behohippy · a year ago

I don't have a video but here's a pic of the output: https://imgur.com/ip8GWIh