25L Portable NV-linked Dual 3090 LLM Rig

Tepix · 6 months ago

OK, here's my quick critique of the article (having built a similar AM4-based system in 2023 for 2300€):

1) [I thought] The page is blocking cut & paste. Super annoying!

2) The exact mainboard is not specified exactly. There are 4 different boards called "ASUS ROG Strix X670E Gaming" and some of them only have one PCIe x16 slot. None of them can do PCIe x8 when using two GPUs.

3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.

4) Despite having a last-modified date of September 22nd, he is using his rig mostly with rather outdated or small LLMs and his benchmarks do not mention their quantization, which makes them useless. Also they seem not to be benchmarks at all, but "estimates". Perhaps the headline should be changed to reflect this?

jychang · 6 months ago

Yeah, this page seems to be not great for beginners and also useless for people with experience.

A 2x 3090 build is okay for inference, but even with nvlink you're a bit handicapped for training. You're much better off with getting a 4090 48GB from China for $2.5k and just using that. Example: https://www.alibaba.com/trade/search?keywords=4090+48gb&pric...

Also, this phrasing is concerning:

> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU. Also the top arctic p14 Max fans don't have mounting points for half of their screw holes, and are in place by being very tightly wedged against the motherboard, case, and PSU. Also, there's probably way too much pressure on the pcie cables coming off the gpus when you close the glass.

someperson · 6 months ago

What an indictment on NVidia market segmentation that there's an industry doing aftermarket VRAM upgrades on gaming cards due their intentionally hobbled VRAM.

I wish AMD and Intel Arc would step up their game.

Gracana · 6 months ago

$2.5k is about $1k more than you'd spend on a pair of 3090s, and people I know who've bought blower 4090s say they sound like hair driers.

tacomagick · 6 months ago

Simply replacing the 3090's with 4090's would provide a major performance uplift assuming your model fits. (I have rented both 3090 and 4090 systems online for research, this comment is based on my personal experience, it is well worth the price increase and the hourly rate for the inference speed you get)

Aurornis · 6 months ago

Don’t those modified cards require hacked drivers? I would not want my expensive video card to depend on hacked drivers that may or may not continue to be available with new updates.

nxobject · 6 months ago

Are the Alibaba 4090s modded to reach 48GB VRAM? (I ask only to figure how why they're that cheap...)

hengheng · 6 months ago

I've also learned the hard way to Google "AM4 main board tier list" before buying.

Some boards can run a 5950X in name only, while others can comfortably run it close to double its spec power all day. VRMs are a real differentiator for this tier of hardware.

(If anyone can comment on the airflow required for 400-500W Epyc CPUs with the tiny VRM heatsinks that Supermicro uses, I'm all ears.)

fkyoureadthedoc · 6 months ago

> The page is blocking cut & paste. Super annoying!

I've been running Don't F* With Paste* for years for this

https://chromewebstore.google.com/detail/dont-f-with-paste/n...

mertleee · 6 months ago

Hmm, I can copy paste just fine from the build page?

jgalt212 · 6 months ago

Interesting. I guess our content-based marketing pages need to move to canvas-based rendering. That's probably bum too. Straight to serving up jpgs.

yjftsjthsd-h · 6 months ago

> 3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.

Forgive a noob question: I thought the connection to the GPU was actually fairly unimportant once the model was loaded, because sending input to the model and getting a response is low bandwidth? So it might matter if you're changing models a lot or doing a model that can work on video, but otherwise I thought it didn't really matter.

Tepix · 6 months ago

In general, if all you do is inference with a model that’s in VRAM, you’re right. OTOH it’s simply a matter of picking the right mainboard. If you have one of those sweet new MoE models that won‘t completely fit in your VRAM, offloading means you want PCIe bandwidth, because it will be a bottleneck. Also swapping between LLMs will be faster.

danparsonson · 6 months ago

> None of them can do PCIe x8 when using two GPUs.

Is that important for this workload? I thought most of the effort was spent processing data on the card rather than moving data on or off of it?

glax · 6 months ago

Sorry for going off topic. But your insight will be helpful on my build

I'm thinking about a low budget system, which will be using

1.X99 D8 MAX LGA2011-3 Motherboard - It has 4 pcie 3.0 x16 slots, dual cpu socket. They are priced around $260 with both the cpu

2. 4X AMD MI50 32G cards - They are old now, but they have 32 gigs of vram and also can be sources at $110 each

The whole setup would not cost more than $1000, is it a right build ? or something more performant can be built within this budget ?

juliangoldsmith · 6 months ago

I'd use caution with the Mi50s. I bought a 16GB one on eBay a while back and it's been completely unusable.

It seems to be a Radeon VII on an Mi50 board, which should technically work. It immediately hangs the first time an OpenCL kernel is run, and doesn't come back up until I reboot. It's possible my issues are due to Mesa or driver config, but I'd strongly recommend buying one to test before going all in.

There are a lot of cheap SXM2 V100s and adapter boards out now, which should perform very well. The adapters unfortunately weren't available when I bought my hardware, or I would have scooped up several.

OakNinja · 6 months ago

Better to buy one used 3090 than those old cards. Everything is not vram. Or, you can do nothing without vram but you can’t do anything with just vram.

To use the second pair of pcie slots, you _must_ have two cpus installed. Just saying in case someone finds a board with just one cpu socket populated.

ericdotlee · 6 months ago

Any reason you wouldn't opt for the 4090 or 5090?

Instantix · 6 months ago

3090 second hand can be found at something like $600.

Dead Comment

lifeinthevoid · 6 months ago

I built a similar system, meanwhile I've sold one of the RTX 3090's. Local inference is fun and feels liberating, but it's also slow, and once I was used to the immense power of the giant hosted models, the fun quickly disappeared.

I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.

imiric · 6 months ago

I have a similar setup as the author with 2x 3090s.

The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.

The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.

elsombrero · 6 months ago

On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code.

I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.

this is my snippet for llama-swap

``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```

ThatPlayer · 6 months ago

> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running

mycall · 6 months ago

> 20-30 tk/s

or ~2.2M tk/day. This is how we should be thinking about it imho.

NicoJuicy · 6 months ago

If you have a 24 gb 3090. Try out qwen:30b-a3b-instruct-2507-q4_K_M ( ollama )

It's pretty good.

naabb · 6 months ago

tbf I also run that on a 16GB 5070TI at 25T/S, it's amazing how fast it runs on consumer grade hardware. I think you could push up to a bigger model but I don't know enough about local llama.

jszymborski · 6 months ago

Don't need a 3090, it runs really fast on an RTX 2080 too.

nenenejej · 6 months ago

Graphics cards are so expensive (list price) they are cheap (no depreciation liquid market)

Our_Benefactors · 6 months ago

Did you really claim GPUs have zero depreciation? That’s obviously false.

AJRF · 6 months ago

> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU.

I built a dual 3090 rig, and this point was why I spent a long time looking for a case where the GPU's could fit side by side with a little gap for airflow

I eventually went with a SilverStone GD11 HTPC which is a PC case for building a media centre, but it's huge inside, has a front fan that takes up 75% of width of the case and also allows the GPUs to stand up right so they don't sag and pull on their thin metal supports.

Highly recommend for a dual GPU build! If you can get dual 5090s instead of 3090s (good luck!) you'd even be able to get "good" airflow in this case.

loudmax · 6 months ago

There was an interesting post to r/LocalLLaMA yesterday from someone running inference mostly on CPU: https://carteakey.dev/optimizing%20gpt-oss-120b-local%20infe...

One of the observations is how much difference memory speed and bandwidth makes, even for CPU inference. Obviously a CPU isn't going to match a GPU for inference speed, but it's an affordable way to run much larger models than you can fit in 24GB or even 48GB of VRAM. If you do run inference on a CPU, you might benefit from some of the same memory optimizations made by gamers: favoring low-latency overclocked RAM.

mistercheph · 6 months ago

Outside of prompt processing, the only reason GPU's are better than CPU's for inference is memory bandwidth, the performance of apple M* devices at inference is a consequence of this, not of their UMA.

JKCalhoun · 6 months ago

I love how the prices for various Llama builds are all over the map on this site.

Oh look, here's one for $43K: https://www.llamabuilds.ai/build/a16zs-personal-ai-workstati...

cl3misch · 6 months ago

> The workplace of the coworker I built this for is truly offline, with no potential for LAN or wifi, so to download new models and update the system periodically I need to go pick it up from him and take it home.

I'm surprised that a "truly offline" workplace allows servers to be taken home and being connected to the internet.

bedstefar · 6 months ago

I worked in the Arctic for the better part of a decade. There's Starlink now, but I've been TRULY OFFLINE for weeks (with plenty diesel generated power) as recently as 2018. Technically we could use Iridium at like $10 per MB, but my full Wikipedia mirror (+ Debian/Ubuntu packages, PyPI etc) did come in handy more than once.

I know some Antarctic research stations (like McMurdo for example) still have connectivity restrictions depending on time-of-day, and I wouldn't be surprised if they also had mirrors of these sort of things, and/or dual-3090 rigs for llama.cpp in the off hours.

deevus · 6 months ago

I’m really interested in this space from an AI sovereignty pov. Is it feasible for SMB/SME to use a box like in the article to get offline analysis of their data? It doesn’t have the worry of sending it off to the cloud.

I wanted to speak with businesses in my local area but no one took me up on it.

ang_cire · 6 months ago

Yes, this is absolutely doable, and many companies are rolling their own ML models (I work with a MedTech company that does, in fact). LLMs are a little more involved, and you'd probably want something beefier than this (maybe a Framework Desktop cluster, if you're not wanting to get into rackmount stuff), but it's definitely feasible for companies to have their own offline LLMs and ML models.

Deleted Comment

gbolcer · 6 months ago

I was going to say you need an extension cable. My first dual 3090 build I had three issues. First was the pcie extension wouldn't support gen4, so I had to change to gen3 in the bios. Second issue was that depending on which slot, you couldn't get x16/x16 and it would drop to x16/x8 unless you had it configured right. Third, I finally gave up and just had the card resting first inside the case and then outside which if fan kicks up, it'll jiggle around, so I had to make some makeshift holder to keep the card sitting there.