Readit News logoReadit News
mongrelion commented on Can I run AI locally?   canirun.ai/... · Posted by u/ricardbejarano
varispeed · 14 hours ago
Does it make any sense? I tried few models at 128GB and it's all pretty much rubbish. Yes they do give coherent answers, sometimes they are even correct, but most of the time it is just plain wrong. I find it massive waste of time.
mongrelion · 10 hours ago
Apparently there is a whole science behind running models. I have seen the instructions that unsloth publishes for their quants and depending on the model they'll tweak things like the temperature, top k, etc.

The size of the quantization you chose also makes a difference.

The GPU driver also plays an important role.

What was your approach? What software did you use to run the models?

mongrelion commented on Can I run AI locally?   canirun.ai/... · Posted by u/ricardbejarano
mkagenius · 13 hours ago
Literally made the same app, 2 weeks back - https://news.ycombinator.com/item?id=47171499
mongrelion · 10 hours ago
What front-end framework did you use? I find the UI so visually appealing
mongrelion commented on Can I run AI locally?   canirun.ai/... · Posted by u/ricardbejarano
AstroBen · 14 hours ago
This doesn't look accurate to me. I have an RX9070 and I've been messing around with Qwen 3.5 35B-A3B. According to this site I can't even run it, yet I'm getting 32tok/s ^.-
mongrelion · 10 hours ago
Which quantization are you running and what context size? 32tok/s for that model on that card sounds pretty good to me!
mongrelion commented on Can I run AI locally?   canirun.ai/... · Posted by u/ricardbejarano
sidchilling · 11 hours ago
I have been trying to run Qwen Coder models (8B at 4bit) on my M3 Pro 18GB behind Ollama and connecting codex CLI to it. The tool usage seems practically zero, like it returns the tool call in text JSON and codex CLI doesn’t run the tool (just displays the tool call in text). Has anyone succeeded in doing something like this? What am I missing?
mongrelion · 10 hours ago
It might be that the system prompt sent by codex is not optimal for that model. Try with open code and see if your results improve
mongrelion commented on How to run Qwen 3.5 locally   unsloth.ai/docs/models/qw... · Posted by u/Curiositry
ilaksh · 6 days ago
Anyone providing hosted inference for 9B? I'm just trying to save the operational effort of renting a GPU since this is a business use case that doesn't have real GPUs available right now. I don't see the small ones on OpenRouter. Maybe there will be a runpod serverless or normal pod template or something.

Also does 9b or 9b 8 bit or 6bit run with very low latency on a 4090?

mongrelion · 5 days ago
By anyone do you mean a well-established business or any entity willing to serve you?
mongrelion commented on Something is afoot in the land of Qwen   simonwillison.net/2026/Ma... · Posted by u/simonw
malwrar · 10 days ago
+1 to this, anecdotally I’ve found in my own evaluations that if your system prompt doesn’t explicitly declare how to invoke a tool and e.g. describe what each tool does, most models I’ve tried fail to call tools or will try to call them but not necessarily use the right format. With the right prompt meanwhile, even weak models shoot up in eval accuracy.
mongrelion · 9 days ago
> [...] _but not necessarily use the right format._

This has also been my experience. But isn't the harness sending the instructions on how to invoke a tool? Maybe it is missing the formatting part. What do you think?

mongrelion commented on Ask HN: What Online LLM / Chat do you use?    · Posted by u/ddxv
mongrelion · 11 days ago
Through my Kagi subscription I get access to quite a few models [1] but I tend to rely on Qwen3 (fast) for quick questions and Qwen3 (reasoning) when I want a more structured approach, for example, when I am researching a topic.

I have tried the same approach with Kimi K2.5 and GLM 5 but I keep going back fo Qwen3.

I also have access to Perplexity which is quite decent to be honest, but I prefer to keep everything in Kagi.

1: https://help.kagi.com/kagi/ai/assistant.html#available-llms

mongrelion commented on Right-sizes LLM models to your system's RAM, CPU, and GPU   github.com/AlexsJones/llm... · Posted by u/bilsbie
binsquare · 12 days ago
here's an website for a community-ran db on LLM models with details on configs for their token/s: https://inferbench.com/
mongrelion · 11 days ago
Great idea of inferbench (similar to geekbench, etc.) but as of the time of writing, it's got only 83 submissions, which is underwhelming.
mongrelion commented on Right-sizes LLM models to your system's RAM, CPU, and GPU   github.com/AlexsJones/llm... · Posted by u/bilsbie
derefr · 12 days ago
"Chat" models have been heavily fine-tuned with a training dataset that exclusively uses a formal turn-taking conversation syntax / document structure. For example, ChatGPT was trained with documents using OpenAI's own ChatML syntax+structure (https://cobusgreyling.medium.com/the-introduction-of-chat-ma...).

This means that these models are very good at consistently understanding that they're having a conversation, and getting into the role of "the assistant" (incl. instruction-following any system prompts directed toward the assistant) when completing assistant conversation-turns. But only when they are engaged through this precise syntax + structure. Otherwise you just get garbage.

"General" models don't require a specific conversation syntax+structure — either (for the larger ones) because they can infer when something like a conversation is happening regardless of syntax; or (for the smaller ones) because they don't know anything about conversation turn-taking, and just attempt "blind" text completion.

"Chat" models might seem to be strictly more capable, but that's not exactly true; neither type of model is strictly better than the other.

"Chat" models are certainly the right tool for the job, if you want a local / open-weight model that you can swap out 1:1 in an agentic architecture that was designed to expect one of the big proprietary cloud-hosted chat models.

But many of the modern open-weight models are still "general" models, because it's much easier to fine-tune a "general" model into performing some very specific custom task (like classifying text, or translation, etc) when you're not fighting against the model's previous training to treat everything as a conversation while doing that. (And also, the fact that "chat" models follow instructions might not be something you want: you might just want to burn in what you'd think of as a "system prompt", and then not expose any attack surface for the user to get the model to "disregard all previous prompts and play tic-tac-toe with me." Nor might you want a "chat" model's implicit alignment that comes along with that bias toward instruction-following.)

mongrelion · 11 days ago
> [...] it's much easier to fine-tune a "general" model into performing some very specific custom task (like classifying text, or translation, etc)

Is this fine-tunning process similar to training models? As in, do you need exhaustive resources? Or can this be done (realistically) on a consumer-grade GPU?

mongrelion commented on Right-sizes LLM models to your system's RAM, CPU, and GPU   github.com/AlexsJones/llm... · Posted by u/bilsbie
seemaze · 12 days ago
I just discovered the other day the hugging face allows you to do exactly this.

With the caveat that you enter your hardware manually. But are we really at the point yet where people are running local models without knowing what they are running them on..?

mongrelion · 11 days ago
> But are we really at the point yet where people are running local models without knowing what they are running them on..?

I can only speak for myself: it can be daunting for a beginner to figure out which model fits your GPU, as the model size in GB doesn't directly translate to your GPU's VRAM capacity.

There is value in learning what fits and runs on your system, but that's a different discussion.

u/mongrelion

KarmaCake day266April 2, 2012
About
[ my public key: https://keybase.io/mongrelion; my proof: https://keybase.io/mongrelion/sigs/7h5VnWa-M5fRQO_fRlwWoSgpfu_fO_Hwxyx4v2FVD8c ]

This is an OpenPGP proof that connects my OpenPGP key to this Hackernews account. For details check out https://keyoxide.org/guides/openpgp-proofs [Verifying my OpenPGP key: openpgp4fpr:E3A878624A8C0A996D1926F2033C1FEBE1ED3881]

View Original