Readit News logoReadit News
elsombrero commented on MacBook Pro with M5 Pro and M5 Max   apple.com/newsroom/2026/0... · Posted by u/scrlk
efxhoy · 12 days ago
How is the output quality of the smaller models?
elsombrero · 12 days ago
not good enough for coding anything more than simple scripts.

generally, the less parameters, the less knowledge they have.

elsombrero commented on Ask HN: What Are You Working On? (Nov 2025)    · Posted by u/david927
elsombrero · 4 months ago
A custom provider for kubernetes cluster autoscaler for homelabs that lets you turn on and off the nodes without reprovisioning them.

https://github.com/homecluster-dev/homelab-autoscaler

https://autoscaler.homecluster.dev

Works with any mechanism to turn on and off nodes(IPMI, WoL...) I have some nodes that I turn on and off via a curl to homeassistant to the power plug.

Deleted Comment

elsombrero commented on 25L Portable NV-linked Dual 3090 LLM Rig   reddit.com/r/LocalLLaMA/c... · Posted by u/tensorlibb
ericdotlee · 6 months ago
What is llama-swap?

Been looking for more details about software configs on https://llamabuilds.ai

elsombrero · 6 months ago
https://github.com/mostlygeek/llama-swap

it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model

so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"

elsombrero commented on 25L Portable NV-linked Dual 3090 LLM Rig   reddit.com/r/LocalLLaMA/c... · Posted by u/tensorlibb
imiric · 6 months ago
Thanks, but I find it hard to believe that a Q1 model would produce decent results.

I see that the Q2 version is around 42GB, which might be doable on 2x 3090s, even if some of it spills over to CPU/RAM. Have you tried Q2?

elsombrero · 6 months ago
well, I tried it and it works for me. llm output is hard to properly evaluate without actually using it.

I read a lot of good comments on r/localllama, with most people suggesting qwen3 coder 30ba3b, but I never got it to work as well as GLM 4.5 air Q1.

As for using Q2, it will fit in vram, but with very small context or spill over to RAM, but with quite an impact on speed depending on your setup. I have slow ddr4 ram and going for Q1 has been a good compromise for me, but YMMV.

elsombrero commented on 25L Portable NV-linked Dual 3090 LLM Rig   reddit.com/r/LocalLLaMA/c... · Posted by u/tensorlibb
imiric · 6 months ago
I have a similar setup as the author with 2x 3090s.

The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.

The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.

elsombrero · 6 months ago
On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code.

I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.

this is my snippet for llama-swap

``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```

elsombrero commented on Guess my RGB   susam.net/myrgb.html... · Posted by u/talonx
mg · 2 years ago
I wonder what the optimal strategy would be if you can't see the color and only the answers when you click submit.

Hillclimbing is already somewhat efficient:

    For each slider:
        - Start at 0
        - Move to the right until the score drops
        - Move one to the left
That should result in something like 9 tries per slider on average, so 27 tries per color.

One signal that could be used to improve it: The difference in score between 0 to 1 gives you the approximate length you have to move to the right.

Due to rounding, you don't get the exact length.

So My guess is that with an optimal strategy, on average you would need something like 4 tries per slider.

That comes down to and average of 12 tries per color.

elsombrero · 2 years ago
you could apply a binary search for each slider and improve the number of tries by moving the slider by half of the shortest distance to the edge

u/elsombrero

KarmaCake day560April 9, 2017View Original