Also does 9b or 9b 8 bit or 6bit run with very low latency on a 4090?
This has also been my experience. But isn't the harness sending the instructions on how to invoke a tool? Maybe it is missing the formatting part. What do you think?
I have tried the same approach with Kimi K2.5 and GLM 5 but I keep going back fo Qwen3.
I also have access to Perplexity which is quite decent to be honest, but I prefer to keep everything in Kagi.
1: https://help.kagi.com/kagi/ai/assistant.html#available-llms
This means that these models are very good at consistently understanding that they're having a conversation, and getting into the role of "the assistant" (incl. instruction-following any system prompts directed toward the assistant) when completing assistant conversation-turns. But only when they are engaged through this precise syntax + structure. Otherwise you just get garbage.
"General" models don't require a specific conversation syntax+structure — either (for the larger ones) because they can infer when something like a conversation is happening regardless of syntax; or (for the smaller ones) because they don't know anything about conversation turn-taking, and just attempt "blind" text completion.
"Chat" models might seem to be strictly more capable, but that's not exactly true; neither type of model is strictly better than the other.
"Chat" models are certainly the right tool for the job, if you want a local / open-weight model that you can swap out 1:1 in an agentic architecture that was designed to expect one of the big proprietary cloud-hosted chat models.
But many of the modern open-weight models are still "general" models, because it's much easier to fine-tune a "general" model into performing some very specific custom task (like classifying text, or translation, etc) when you're not fighting against the model's previous training to treat everything as a conversation while doing that. (And also, the fact that "chat" models follow instructions might not be something you want: you might just want to burn in what you'd think of as a "system prompt", and then not expose any attack surface for the user to get the model to "disregard all previous prompts and play tic-tac-toe with me." Nor might you want a "chat" model's implicit alignment that comes along with that bias toward instruction-following.)
Is this fine-tunning process similar to training models? As in, do you need exhaustive resources? Or can this be done (realistically) on a consumer-grade GPU?
With the caveat that you enter your hardware manually. But are we really at the point yet where people are running local models without knowing what they are running them on..?
I can only speak for myself: it can be daunting for a beginner to figure out which model fits your GPU, as the model size in GB doesn't directly translate to your GPU's VRAM capacity.
There is value in learning what fits and runs on your system, but that's a different discussion.
The size of the quantization you chose also makes a difference.
The GPU driver also plays an important role.
What was your approach? What software did you use to run the models?