Dead Comment
Dead Comment
Dead Comment
I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.
this is my snippet for llama-swap
``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```
Been looking for more details about software configs on https://llamabuilds.ai
I used:
Very high quality and manageable prices.
1) [I thought] The page is blocking cut & paste. Super annoying!
2) The exact mainboard is not specified exactly. There are 4 different boards called "ASUS ROG Strix X670E Gaming" and some of them only have one PCIe x16 slot. None of them can do PCIe x8 when using two GPUs.
3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.
4) Despite having a last-modified date of September 22nd, he is using his rig mostly with rather outdated or small LLMs and his benchmarks do not mention their quantization, which makes them useless. Also they seem not to be benchmarks at all, but "estimates". Perhaps the headline should be changed to reflect this?
Curious if anyone here has a similar config of 1-4 RTX 3060s? Trying to decide if picking up a few of these is a good value or if I should just continue renting cloud GPU's?
However, the internet seems littered with "clever" loca ai monstrosities that gang together 4-6 ancient nVidia GPU's (priced today to seem like overpriced e-waste) to get lackluster performance from piles of nVidia m60's and P100's? In 2025 this kind of seems like a waste or just bad advice to use hardware this old?
Curious if this find seems like a good source of info regarding staying away from Intel and AMD GPU's for local inference? Might do some training but right now more interested in light RAG and maybe some local coding.
Hoping to build something before the holiday season to keep my office warm with GPU's :).
Thanks!