Do 2bit quantizations really work? All the ones I've seen/tried were completely broken even when 4bit+ quantizations worked perfectly. Even if it works for these extremely large models, is it really much better than using something slightly smaller on 4 or 5 bit quant?
I had given up long time ago on self hosted transformer models for coding because the SOTA was definetly in favor of SaaS. This might just give me another try.
Would llama.cpp support multiple (rtx 3090, no nvlink hw bridge) GPUs over PCIe4? (Rest of the machine is 32 CPU cores, 256GB RAM)
How fast you run this model will strongly depend on if you have DDR4 or DDR5 ram.
You will be mostly using 1 of your 3090s. The other one will be basically doing nothing. You CAN put the MoE weights on the 2nd 3090, but it's not going to speed up inference much, like <5% speedup. As in, if you lack a GPU, you'd be looking at <1 token/sec speeds depending on how fast your CPU does flops, and if you have a single 3090 you'd be doing 10tokens/ec, but with 2 3090s you'll still just be doing maybe 11tok/sec. These numbers are made up, but you get the idea.
Qwen3 Coder 480B is 261GB for IQ4_XS, 276GB for Q4_K_XL, so you'll be putting all the expert weights in RAM. That's why your RAM bandwidth is your limiting factor. I hope you're running off a workstation with dual cpus and 12 sticks of DDR5 RAM per CPU, which allows you to have 24 channel DDR5 RAM.
Thank you for your work, does the Qwen3-Coder offer significant advantage over Qwen2.5-coder for non-agentic tasks like just plain autocomplete and chat?
Yes! 3bit maybe 4bit can also fit! llama.cpp has MoE offloading so your GPU holds the active experts and non MoE layers, thus you only need 16GB to 24GB of VRAM! I wrote about how to do in this section: https://docs.unsloth.ai/basics/qwen3-coder#improving-generat...
> Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first
I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But since for the foreseeable future, I'll probably sometimes want to "call in" a bigger model that I can't realistically or affordably host on my own computer, I love having the option of high-quality open-weight models for this, and I also like the idea of "paying in" for the smaller open-weight models I play around with by renting access to their larger counterparts.
Congrats to the Qwen team on this release! I'm excited to try it out.
> I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close.
Likewise, I found that the regular Qwen3-30B-A3B worked pretty well on a pair of L4 GPUs (60 tokens/second, 48 GB of memory) which is good enough for on-prem use where cloud options aren't allowed, but I'd very much like a similar code specific model, because the tool calling in something like RooCode just didn't work with the regular model.
In those circumstances, it isn't really a comparison between cloud and on-prem, it's on-prem vs nothing.
30B-A3B works extremely well as a generalist chat model when you pair with scaffolding such as web search. It's fast (for me) using my workstation at home running a 5070 + 128GB of DDR4 3200 RAM @ ~28 tok/s. Love MoE models.
Sadly it falls short during real world coding usage, but fingers crossed that a similarly sized coder variant of Qwen 3 can fill in that gap for me.
This is my script for the Q4_K_XL version from unsloth at 45k context:
I love Qwen3-30B-A3B for translation and fixing up transcripts generated by automatic speech recognition models. It's not the most stylish translator (a bit literal), but it's generally better than the automatic translation features built into most apps, and it's much faster since there's no network latency.
It has also been helpful (when run locally, of course) for addressing questions-- good faith questions, not censorship tests to which I already know the answers-- about Chinese history and culture that the DeepSeek app's censorship is a little too conservative for. This is a really fun use case actually, asking models from different parts of the world to summarize and describe historical events and comparing the quality of their answers, their biases, etc. Qwen3-30B-A3B is fast enough that this can be as fun as playing with the big, commercial, online models, even if its answers are not equally detailed or accurate.
Give devstral a try, fp8 should fit in 48GB, it was surprisingly good for a 24B local model, w/ cline/roo. Handles itself well, doesn't get stuck much, most of the things work OK (considering the size ofc)
Currently, the goal of everyone is creating one master model to rule them all, so we haven't seen too much specialization. I wonder how much more efficient smaller models could be if we created language specialized models.
It feels intuitively obvious (so maybe wrong?) that a 32B Java Coder would be far better at coding Java than a generalist 32B Coder.
I’ll fill the role to push back on your Java coder idea!
First, Java code tends to be written a certain way, and for certain goals and business domains.
Let’s say 90% of modern Java is a mix of:
* students learning to program and writing algorithms
* corporate legacy software from non-tech focused companies
If you want to build something that is uncommon in that subset, it will likely struggle due to a lack of training data. And if you wanted to build something like a game, the majority of your training data is going to be based on ancient versions of Java, back when game development was more common in Java.
Comparatively, including C in your training data gives you exposure to a whole separate set of domain data for training, like IoT devices, kernels, etc.
Including Go will likely include a lot more networking and infrastructure code than Java would have had, which means there is also more context to pull from in what networking services expect.
Code for those domains follow different patterns, but the concepts can still be useful in writing Java code.
Now, there may be a middle ground where you could have a model that is still general for many coding languages, but given extra data and fine-tuning focused on domain-specific Java things — like more of a “32B CorporateJava Coder” model — based around the very specific architecture of Spring. And you’d be willing to accept that model to fail at writing games in Java.
It’s interesting to think about for sure - but I do feel like domain-specific might be more useful than language-specific
small models can never match bigger models, the bigger models just know more and are smarter. the smaller models can get smarter, but as they do, the bigger models get smart too. HN is weird because at one point this was the location where I found the most technically folks, and now for LLM I find them at reddit. tons of folks are running huge models, get to researching and you will find out you can realistically host your own.
> small models can never match bigger models, the bigger models just know more and are smarter.
They don't need to match bigger models, though. They just need to be good enough for a specific task!
This is more obvious when you look at the things language models are best at, like translation. You just don't need a super huge model for translation, and in fact you might sometimes prefer a smaller one because being able to do something in real-time, or being able to run on a mobile device, is more important than marginal accuracy gains for some applications.
I'll also say that due to the hallucination problem, beyond whatever knowledge is required for being more or less coherent and "knowing" what to write in web search queries, I'm not sure I find more "knowledgeable" LLMs very valuable. Even with proprietary SOTA models hosted on someone else's cloud hardware, I basically never want an LLM to answer "off the dome"; IME it's almost always wrong! (Maybe this is less true for others whose work focuses on the absolute most popular libraries and languages, idk.) And if an LLM I use is always going to be consulting documentation at runtime, maybe that knowledge difference isn't quite so vital— summarization is one of those things that seems much, much easier for language models than writing code or "reasoning".
All of that is to say:
Sure, bigger is better! But for some tasks, my needs are still below the ceiling of the capabilities of a smaller model, and that's where I'm focusing on local usage. For now that's mostly language-focused tasks entirely apart from coding (translation, transcription, TTS, maybe summarization). It may also include simple coding tasks today (e.g., fancy auto-complete, "ghost-text" style). I think it's reasonable to hope that it will eventually include more substantial programming tasks— even if larger models are still preferable for more sophisticated tasks (like "vibe coding", maybe).
If I end up having a lot of fun, in a year or two I'll probably try to put together a machine that can indeed run larger models. :)
> HN is weird because at one point this was the location where I found the most technically folks, and now for LLM I find them at reddit.
Is this an effort to chastise the viewpoint advanced? Because his viewpoint makes sense to me: I can run biggish models on my 128GB Macbook but not huge ones-- even 2b quantized ones suck too many resources.
So I run a combination of local stuff and remote stuff depending upon various factors (cost, sensitivity of information, convenience/whether I'm at home, amount of battery left, etc ;)
Yes, bigger models are better, but often smaller is good enough.
The large models are using tools/functions to make them useful. Sooner or later open source will provide a good set of tools/functions for coding as well.
I'd be interested in smaller models that were less general, with a training corpus more concentrated. A bash scripting model, or a clojure model, or a zig model, etc.
Well yes tons of people are running them but they're all pretty well off.
I don't have 10-20k$ to spend on this stuff. Which is about the minimum to run a 480B model, with huge quantisation. And pretty slow because for that price all you get is an old Xeon with a lot of memory or some old nvidia datacenter cards. If you want a good setup it will cost a lot more.
So small models it is. Sure, the bigger models are better but because the improvements come so fast it means I'm only 6 months to a year behind the big ones at any time. Is that worth 20k? For me no.
Not really true. Gemma from Google with quantized aware training does an amazing job.
Under the hood, the way it works, is that when you have final probabilities, it really doesn't matter if the most likely token is selected with 59% or 75% - in either case it gets selected. If the 59% case gets there with smaller amount of compute, and that holds across the board for the training set, the model will have similar performance.
In theory, it should be possible to narrow down models even smaller to match the performance of big models, because I really doubt that you do need transformers for every single forward pass. There are probably plenty of shortcuts you can take in terms of compute for sets of tokens in the context. For example, coding structure is much more deterministic than natural text, so you probably don't need as much compute to generate accurate code.
You do need a big model first to train a small model though.
As for running huge models locally, its not enough to run them, you need good throughput as well. If you spend $2k on a graphics card, that is way more expensive than realistic usage with a paid API, and slower output as well.
What seems to be typical these days is that big companies ship the first tool very fast, in poor condition (applies to Gemini CLI as well), and then let the OSS ecosystem fix the issues. Backend is closed so the app is their best shot. Then after some time the company gets the most credit and not all the contributors.
I currently use claude-code as the director basically, but outsource heavy thinking to openai and gemini pro via zen mcp. I could instead use gemini-cli as it's also supported by zen. I would imagine it's trivial to add qwen-coder support if it's based on gemini-cli.
I’ve instead used a Gemini via plain ol’ chat, first building a competitive, larger context than Claude can hold then manually bringing detailed plans and patches to Gemini for feedback with excellent results.
I presumed mcp wouldn’t give me the focused results I get from completely controlling Gemini.
And that making CC interface via the MCP would also use up context on that side.
We shipped RA.Aid, an agentic evolution of what aider started, back in late '24, well before CC shipped.
Our main focuses were to be 1) CLI-first and 2) truly an open source community. We have 5 independent maintainers with full commit access --they aren't from the same org or entity (disclaimer: one has joined me at my startup Gobii where we're working on web browsing agents.)
I'd love someone to do a comparison with CC, but IME we hold our own against Cursor, Windsurf, and other agentic coding solutions.
But yes, there really needs to be a canonical FOSS solution that is not tied to any specific large company or model.
> I hope these OSS CC clones converge at some point.
Imo, the point of custom CLIs is that each model is trained to handle tool calls differently. In my experience, the tool call performance is wildly different (although they have started converging recently). Convergence is meaningful only when the models and their performance are commoditized and we haven't reached that stage yet.
I’ll throw out a mention for my project Plandex[1], which predates Claude Code and combines models from multiple providers (Anthropic, Google, and OpenAI by default). It can also use open source and local models.
It focuses especially on large context and longer tasks with many steps.
Have you measured and compared your agent's efficiency and success rate against anything? I am curious. It would help people decide; there are many coding agents now.
Does Plandex have an equivalent to sub-agents/swarm or whatever you want to call it?
I’ve found getting CC to farm out to subagents to be the only way to keep context under control, but would love to bring in a different model as another subagent to review the work of the others.
At my work, here is a typical breakdown of time spent by work areas for a software engineer. Which of these areas can be sped up by using agentic coding?
05%: Making code changes
10%: Running build pipelines
20%: Learning about changed process and people via zoom calls, teams chat and emails
15%: Raising incident tickets for issues outside of my control
20%: Submitting forms, attending reviews and chasing approvals
20%: Reaching out to people for dependencies, following up
10%: Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated
Really though? That’s only 2 hours per week writing code.
It’s true to say that time writing code is usually a minority of a developer’s work time, and so an AI that makes coding 20% faster may only translate to a modest dev productivity boost. But 5% time spent coding is a sign of serious organizational disfunction.
This is what software engineers need to be more productive:
- Agentic DevOps: provisions infra and solves platform issues as soon as a support ticket is created.
- Agentic Technical Writer: one GenAI agent writes the docs and keeps the wiki up to date, while another 100 agents review it all and flag hallucinations.
- Agentic Manager: attends meetings, parses emails and logs 24x7 and creates daily reports, shares these reports with other teams, and manages the calendar of the developers to shield them from distractions.
- Agentic Director: spots patterns in the data and approves things faster, without the fear of getting fired.
- Agentic CEO: helps with decision-making, gives motivational speeches, and aligns vision with strategy.
- Agentic Pet: a virtual mascot you have to feed four times a day, Monday to Friday, from your office's IP address. Miss a meal and it dies, and HR gets notified. (This was my boss's idea)
You're not wrong, but it's a "dysfunction" that many successful tech companies have learned to leverage.
The reality is, most engineers spend far less than half their time writing new code. This is where the 80/20 principle comes into play. It's common for 80% of a company's revenue to come from 20% of its features. That core, revenue-generating code is often mature and requires more maintenance than new code. Its stability allows the company to afford what you call "dysfunction": having a large portion of engineers work on speculative features and "big bets" that might never see the light of day.
So, while it looks like a bug from a pure "coding hours" perspective, for many businesses, it's a strategic feature!
IMHO, the biggest impact LLMs have had in my day to day has not been agentic coding. For example, meeting summarisers are great, it means I sometimes can skip a call or join while doing other things and I still get a list of bullet points afterwards.
I can point at a huge doc for some API and get the important things right away, or ask questions of it. I can get it to review PRs so I can quickly get the gist of the changes before digging into the code myself.
For coding, I don't find agents boost my productivity that much where I was already productive. However, they definitely allow me to do things I was unable to before (or would have taken very long as I wasn't an expert) – for example my type signatures have improved massively, in places where normally I would have been lazy and typed as any I now ask claude to come up with some proper types.
I've had it write code for things that I'm not great at, like geometry, or dataviz. But these are not necessarily increasing my productivity, they reduce my reliance on libraries and such, but they might actually make me less productive.
I've been on embedded projects where several weeks of work were spent on changing one line of code. It's not necessarily organizational dysfunction. Sometimes it's getting the right data and the right deep understanding of a system, hardware/software interaction, etc, before you can make an informed change that affects thousands of people.
Unfortunately it is true with any org that is rapidly reducing their risk appetite. It is not dysfunctional. It is about balancing the priorities at org level. Risk is distributed very thinly across many people. Heard of re-insurance business? sort of similar thing happens in software development as well.
It doesn't if you have to manually check all that code. (Or even worse, you dump the code into a pull request and force someone else to manually check it - please do not do that.)
5% is pretty low but similar to what i have seen on low performing teams at 10K+ employee multinationals. this would also be why the vast majority of software today is bug ridden garbage that runs slower than the software we were using 20 years ago.
agentic coding will not fix these systemic issues caused by organizational dysfunction. agentic coding will allow the software created by these companies to be rewritten from scratch for 1/100th the cost with better reliability and performance though.
the resistance to AI adoption inside corporations that operate like this is intense and will probably intensify.
it takes a combination of external competitive pressure, investor pressure, attrition, PE takeovers, etc, to grind down internal resistance, which takes years or decades depending on the situation.
"10% running build pipelines + 20% submitting forms" vs 5% making code changes?
Are you in heavily regulated industry or dysfunctional organization?
Most big tech optimize their build pipelines a lot to reduce commit to deploy (or validation/test process) which keeps engineers focus on the same task while problem/solution is fresh.
How about you find out for yourself? Keep a chat window or an agent open and ask it how it could help with your tasks. My git messages and gitlab tickets are being written by AI for a year now, way better than anything I would half heartedly do on my side, really good commit messages too. Claude even reminds me to create/update the ticket.
I find the commits written by AI often inadequate, as they mostly just describe what is already in the diff, but miss the background on why was the change needed, why this approach was chosen, etc, the important stuff...
Do you feed the LLM additional context for the commit message, or it is just summarising what’s in the commit? In the latter case, what’s the point? The reader can just get _their_ LLM to do a better job.
In the former case… I’m interested to hear how they’re better? Do you choose an agent with the full context of the changes to write the message, so it knows where you started, why certain things didn’t work? Or are you prompting a fresh context with your summary and asking it to make it into a commit message? Or something else?
We must have the same job! Generating code is a miniscule part of my job. We have the same level of organizational disfunction. Mostly the work part involves long investigations of customer bugs and long face to face calls with customers - I'm only getting the stuff that stumped level 1 and level 2 support.
I actually tried to use Qwen3[1] to analyse customer cases and it was worse than useless at it.
[1] We can't use any online model as these bug reports contain large amounts of PII, customer data, etc.
Many of those things could be improved today without AI but e.g. raising Incidents for issues outside of your control could also give you a suggestion already that you just have to tick off.
Not saying we are there yet but hard to imagine it's not possible.
Raising incidents is not about suggestions. Things like build pipelines run into issues, someone from Ops need to investigate, and maybe bump up some pods or apply some config changes on their end. Or some wiki page has conflicting information, someone need to update it with correct information after checking with the relevant other people, policies and standards. The other people might be on vacation and their delegate misguides as they are not aware of the recently changed process.
Also, you're not making an argument against agentic coding, you're actually making an argument for it - you don't have time to code, so you need someone or something to code for you.
You should automate this, like i did. You're an engineer, no? Work around the digital bureaucracy.
- Running build pipelines: make cli tool to initiate them, monitor them and notify you on completion/error (audio). Allows to chain multiple things. Run in background terminal.
- Learning about changed process and people via zoom calls, teams chat and emails: pass logs of chats and emails to LLM with particular focus. Demand zoom calls transcripts published for that purposes (we use meet)
- Raising incident tickets for issues outside of my control: automate this with agent: allow it to access as much as needed, and guide it with short guidance - all doable via claude code + custom MCP
- Submitting forms, attending reviews and chasing approvals - best thing to automate. They want forms? They will have forms. Chasing approvals - fire and forget + queue management, same.
- Reaching out to people for dependencies, following up: LLM as personal assistant is classic job. Code this away.
- Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated: index all data and put it into RAG, let agent dig deeper.
Most of the time you spend is on scheduling micro-tasks, switching between them and maintaining unspoken queue of checking various saas frontends. Formalize micro-task management, automate endpoints, and delegate it to your own selfware (ad-hoc tools chain you vibe coded for yourself only, tailored for particular working environment).
I do this all (almost) to automate away non-coding tasks. Life is fun again.
In the short term, I think humans will be doing more of technical / product alignment, f2f calls (especially with non-technical folks), digesting illegible requirements, etc.
Coding, debugging builds, paperwork, doc chasing are all tasks that AI is improving on rapidly.
If 95% of employee time is work coordination, then executive leadership needs to downsize aggressively. This is a comical example of Brooks Law. Likewise, your clients or customers should be outraged and demand proof that pricing reflects business value and $0.95 of every dollar they give your company isn’t wasted.
There are so many problems in the world we need to stop cramming into the same bus.
I've been using it all day, it rips. Had to bump up toolcalling limit in cline to 100 and it just went through the app no issues, got the mobile app built, fixed throug hthe linter errors... wasn't even hosting it with the toolcall template on with the vllm nightly, just stock vllm it understood the toolcall instructions just fine
Im interested in more info? Where do you host it? Whats the hardware, and exact model? What t/s do you get? What is the codebase size? Etc pls, thank you
This suggests adding a `QWEN.md` in the repo for agents instructions.
Where are we with `AGENTS.md`? In a team repo it's getting ridiculous to have a duplicate markdown file for every agent out there.
Can't these hyper-advanced-super-duper tools discover what UNIX tools since circa 1970 knew, and just have a flag/configuration setting pointing them to the config file location? Excuse me if they already do :-)
In which case you'd have 1 markdown file and at least for the ones that are invoked via the CLI, just set up a Makefile entry point that leads them to the correct location.
It would be funny to write conflicting instructions on these, and then unleash different coding agents on the same repo in parallel, and see which one of them first identifies the interference from the others and rewrites their instructions to align with its own.
> This node.js CLI tool processes CLAUDE.md files with hierarchical collection and recursive @-import resolution. Walks directory tree from current to ~/.claude/, collecting all CLAUDE.md files and processing them with file import resolution. Saves processed context files with resolved imports next to the original CLAUDE.md files or in a specific location (configurable).
I mostly use Claude Code, but every now and then go with Gemini, and having to maintain two sets of (hierarchical) instructions was annoying. And then opencode showed up, which meant yet another tool I wanted to try out and …well.
Maybe there could be an agent that is in charge of this and it's trained to automatically create a file for any new agent, it could even temporarily delete local copies of MD files that no agents are using at the moment to free the visual clutter when navigating the repo.
How does one keep up with all this change? I wish we could fast-forward like 2-3 years to see if an actual winner has landed by then. I feel like at that point there will be THE tool, with no one thinking twice about using anything else.
One keeps up with it, by keeping up with it. Folks keep up with latest social media gossip, the news, TV shows, or whatever interests them. You just stay on it.
Weekend I got to running Kimi K2, last 2 days I have been driving Ernie4.5-300B, Just finished downloading the latest Qwen3-235b this morning and started using it this evening. Tonight I'll start downloading this 480B, might take 2-3 days with my crappy internet and then I'll get to it.
Yeah second this. I find model updates mildly interesting, but besides grok 4 I haven’t even tried a new model all year.
Its a bit like the media cycle. The more jacked in you are, the more behind you feel. I’m less certain there will be winners as much as losers, but for sure the time investment on staying up to date on these things will not pay dividends to the average hn reader
I'm using claude code and making stuff. I'm keeping an eye and being aware of these new tools but I wait for the dust to settle and see if people switch or are still hyped after the hype dies down. X / HackerNews are good for keeping plugged in.
The underlying models are apparently profitable. Inference costs are in a exponential fall that makes Gordon Moore faint. OpenRouter shows Anthropic, AWS, Google host Claude at same rates, apparently nobody is price dumping.
That said, code+git+agent is only acceptable way for technical staff to interact with AI. Tools with sparkles button can go to hell.
We don't actually need a winner, we need 2-3-4 big, mature commercial contenders for the state of the art stuff, and 2-3-4 big, mature Open Source/open weights models that can be run on decent consumer hardware at near real-time speeds, and we're all set.
Sure, there will probably be a long tail, but the average programmer probably won't care much about those, just like they don't care about Erlang, D, MoonScript, etc.
Things will be moving faster in 2-3 years most likely. (The recursive self-improvement flywheel is only just starting to pick up momentum, and we’ll have much more LLM inference compute available.)
Figuring out how to stay sane while staying abreast of developments will be a key skill to cultivate.
I’m pretty skeptical there will be a single model with a defensible moat TBH. Like cloud compute, there is both economy of scale and room for multiple vendors (not least because bigco’s want multiple competing bids).
I'm actually waiting for something different - a "good enough" level for programming LLMs:
1. Where they can be used as autocompletion in an IDE at speeds comparable with Intellisense
2. And where they're good enough to generate most code reliably, while using a local LLM
3. While running on hardware costing in total max 2000€
4. And definitely with just a few "standard" pre-configured Open Source/open weights LLMs where I don't have to become an LLM engineer to figure out the million knobs
I have no clue how Intellisense works behind the scenes, yet I use it every day. Same story here.
It depends on the level of 'keeping up'. I follow the news, but it's impossible to dip your toe in every new model. Some stick around, but the majority pass through.
> Mass adoption is rarely a quality indicator. I wouldn't want to pay for the mainstream VHS model(s) when I could use Betamax (perhaps even cheaper).
Oh, but it is.
Imagine you were then, back in those days. A few years after VHS won, you couldn't find your favorite movies on Betamax. There was a lot more hardware, and cheaper, available, for VHS.
Mass adoption largely wins out over almost everything.
Case in point from software: Visual Basic, PHP, Javascript, Python (though Python is slightly more technically sound than the other ones), early MySQL, MongoDB, early Windows, early Android.
Why do you believe so? The leaderboard is highly unstable right now and there are no signs of that subsiding. I would expect the same situation 2-3 years forward, just possibly with somewhat different players.
I tried using the "fp8" model through hyperbolic but I question if it was even that model. It was basically useless through hyperbolic.
I downloaded the 4bit quant to my mac studio 512GB. 7-8 minutes until first tokens with a big Cline prompt for it to chew on. Performance is exceptional. It nailed all the tool calls, loaded my memory bank, and reasoned about a golang code base well enough to write a blog post on the topic: https://convergence.ninja/post/blogs/000016-ForeverFantasyFr...
Writing blog posts is one of the tests I use for these models. It is a very involved process including a Q&A phase, drafting phase, approval, and deployment. The filenames follow a certain pattern. The file has to be uploaded to s3 in a certain location to trigger the deployment. It's a complex custom task that I automated.
Even the 4bit model was capable of this, but was incapable of actually working on my code, prefering to halucinate methods that would be convenient rather than admitting it didn't know what it was doing. This is the 4 bit "lobotomized" model though. I'm excited to see how it performs at full power.
Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder
Important layers are in 8bit, 6bit. Less important ones are left in 2bit! I talk more about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs
Would llama.cpp support multiple (rtx 3090, no nvlink hw bridge) GPUs over PCIe4? (Rest of the machine is 32 CPU cores, 256GB RAM)
You will be mostly using 1 of your 3090s. The other one will be basically doing nothing. You CAN put the MoE weights on the 2nd 3090, but it's not going to speed up inference much, like <5% speedup. As in, if you lack a GPU, you'd be looking at <1 token/sec speeds depending on how fast your CPU does flops, and if you have a single 3090 you'd be doing 10tokens/ec, but with 2 3090s you'll still just be doing maybe 11tok/sec. These numbers are made up, but you get the idea.
Qwen3 Coder 480B is 261GB for IQ4_XS, 276GB for Q4_K_XL, so you'll be putting all the expert weights in RAM. That's why your RAM bandwidth is your limiting factor. I hope you're running off a workstation with dual cpus and 12 sticks of DDR5 RAM per CPU, which allows you to have 24 channel DDR5 RAM.
What would be a reasonable throughput level to expect from running 8-bit or 16-bit versions on 8x H200 DGX systems?
You should be able to get 40 to 50 tokens / s in the minimum. High throughput mode + a small draft model might get you 100 tokens / s generation
If you don't have enough RAM, then < 1 token / s
Dead Comment
Dead Comment
I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But since for the foreseeable future, I'll probably sometimes want to "call in" a bigger model that I can't realistically or affordably host on my own computer, I love having the option of high-quality open-weight models for this, and I also like the idea of "paying in" for the smaller open-weight models I play around with by renting access to their larger counterparts.
Congrats to the Qwen team on this release! I'm excited to try it out.
Likewise, I found that the regular Qwen3-30B-A3B worked pretty well on a pair of L4 GPUs (60 tokens/second, 48 GB of memory) which is good enough for on-prem use where cloud options aren't allowed, but I'd very much like a similar code specific model, because the tool calling in something like RooCode just didn't work with the regular model.
In those circumstances, it isn't really a comparison between cloud and on-prem, it's on-prem vs nothing.
Sadly it falls short during real world coding usage, but fingers crossed that a similarly sized coder variant of Qwen 3 can fill in that gap for me.
This is my script for the Q4_K_XL version from unsloth at 45k context:
llama-server.exe --host 0.0.0.0 --no-webui --alias "Qwen3-30B-A3B-Q4_K_XL" --model "F:\models\unsloth\Qwen3-30B-A3B-128K-GGUF\Qwen3-30B-A3B-128K-UD-Q4_K_XL.gguf" --ctx-size 45000 --n-gpu-layers 99 --slots --metrics --batch-size 2048 --ubatch-size 2048 --temp 0.6 --top-p 0.95 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.1 --jinja --reasoning-format deepseek --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn --no-mmap --threads 8 --cache-reuse 256 --override-tensor "blk\.([0-9][02468])\.ffn_._exps\.=CPU"
It has also been helpful (when run locally, of course) for addressing questions-- good faith questions, not censorship tests to which I already know the answers-- about Chinese history and culture that the DeepSeek app's censorship is a little too conservative for. This is a really fun use case actually, asking models from different parts of the world to summarize and describe historical events and comparing the quality of their answers, their biases, etc. Qwen3-30B-A3B is fast enough that this can be as fun as playing with the big, commercial, online models, even if its answers are not equally detailed or accurate.
It feels intuitively obvious (so maybe wrong?) that a 32B Java Coder would be far better at coding Java than a generalist 32B Coder.
First, Java code tends to be written a certain way, and for certain goals and business domains.
Let’s say 90% of modern Java is a mix of: * students learning to program and writing algorithms * corporate legacy software from non-tech focused companies
If you want to build something that is uncommon in that subset, it will likely struggle due to a lack of training data. And if you wanted to build something like a game, the majority of your training data is going to be based on ancient versions of Java, back when game development was more common in Java.
Comparatively, including C in your training data gives you exposure to a whole separate set of domain data for training, like IoT devices, kernels, etc.
Including Go will likely include a lot more networking and infrastructure code than Java would have had, which means there is also more context to pull from in what networking services expect.
Code for those domains follow different patterns, but the concepts can still be useful in writing Java code.
Now, there may be a middle ground where you could have a model that is still general for many coding languages, but given extra data and fine-tuning focused on domain-specific Java things — like more of a “32B CorporateJava Coder” model — based around the very specific architecture of Spring. And you’d be willing to accept that model to fail at writing games in Java.
It’s interesting to think about for sure - but I do feel like domain-specific might be more useful than language-specific
fine tuned rather than created from scratch though.
Is this the one? https://github.com/ggml-org/llama.vscode it sems to be built for code completion rather than outright agent mode
They don't need to match bigger models, though. They just need to be good enough for a specific task!
This is more obvious when you look at the things language models are best at, like translation. You just don't need a super huge model for translation, and in fact you might sometimes prefer a smaller one because being able to do something in real-time, or being able to run on a mobile device, is more important than marginal accuracy gains for some applications.
I'll also say that due to the hallucination problem, beyond whatever knowledge is required for being more or less coherent and "knowing" what to write in web search queries, I'm not sure I find more "knowledgeable" LLMs very valuable. Even with proprietary SOTA models hosted on someone else's cloud hardware, I basically never want an LLM to answer "off the dome"; IME it's almost always wrong! (Maybe this is less true for others whose work focuses on the absolute most popular libraries and languages, idk.) And if an LLM I use is always going to be consulting documentation at runtime, maybe that knowledge difference isn't quite so vital— summarization is one of those things that seems much, much easier for language models than writing code or "reasoning".
All of that is to say:
Sure, bigger is better! But for some tasks, my needs are still below the ceiling of the capabilities of a smaller model, and that's where I'm focusing on local usage. For now that's mostly language-focused tasks entirely apart from coding (translation, transcription, TTS, maybe summarization). It may also include simple coding tasks today (e.g., fancy auto-complete, "ghost-text" style). I think it's reasonable to hope that it will eventually include more substantial programming tasks— even if larger models are still preferable for more sophisticated tasks (like "vibe coding", maybe).
If I end up having a lot of fun, in a year or two I'll probably try to put together a machine that can indeed run larger models. :)
Is this an effort to chastise the viewpoint advanced? Because his viewpoint makes sense to me: I can run biggish models on my 128GB Macbook but not huge ones-- even 2b quantized ones suck too many resources.
So I run a combination of local stuff and remote stuff depending upon various factors (cost, sensitivity of information, convenience/whether I'm at home, amount of battery left, etc ;)
Yes, bigger models are better, but often smaller is good enough.
I don't have 10-20k$ to spend on this stuff. Which is about the minimum to run a 480B model, with huge quantisation. And pretty slow because for that price all you get is an old Xeon with a lot of memory or some old nvidia datacenter cards. If you want a good setup it will cost a lot more.
So small models it is. Sure, the bigger models are better but because the improvements come so fast it means I'm only 6 months to a year behind the big ones at any time. Is that worth 20k? For me no.
I was surprised in the AlphaEvolve paper how much they relied on the flash model because they were optimizing for speed of generating ideas.
Under the hood, the way it works, is that when you have final probabilities, it really doesn't matter if the most likely token is selected with 59% or 75% - in either case it gets selected. If the 59% case gets there with smaller amount of compute, and that holds across the board for the training set, the model will have similar performance.
In theory, it should be possible to narrow down models even smaller to match the performance of big models, because I really doubt that you do need transformers for every single forward pass. There are probably plenty of shortcuts you can take in terms of compute for sets of tokens in the context. For example, coding structure is much more deterministic than natural text, so you probably don't need as much compute to generate accurate code.
You do need a big model first to train a small model though.
As for running huge models locally, its not enough to run them, you need good throughput as well. If you spend $2k on a graphics card, that is way more expensive than realistic usage with a paid API, and slower output as well.
Untrue. The big important issue for LLMs is hallucination, and making your model bigger does little to solve it.
Increasing model size is a technological dead end. The future advanced LLM is not that.
Very interesting. Any subs or threads you could recommend/link to?
Thanks
https://github.com/QwenLM/qwen-codehttps://github.com/QwenLM/qwen-code/blob/main/LICENSE
I hope these OSS CC clones converge at some point.
Actually it is mentioned in the page:
It would be great if it starts supporting other models too natively. Wouldn't require people to fork.
I’ve instead used a Gemini via plain ol’ chat, first building a competitive, larger context than Claude can hold then manually bringing detailed plans and patches to Gemini for feedback with excellent results.
I presumed mcp wouldn’t give me the focused results I get from completely controlling Gemini.
And that making CC interface via the MCP would also use up context on that side.
Our main focuses were to be 1) CLI-first and 2) truly an open source community. We have 5 independent maintainers with full commit access --they aren't from the same org or entity (disclaimer: one has joined me at my startup Gobii where we're working on web browsing agents.)
I'd love someone to do a comparison with CC, but IME we hold our own against Cursor, Windsurf, and other agentic coding solutions.
But yes, there really needs to be a canonical FOSS solution that is not tied to any specific large company or model.
You set the environment variable ANTHROPIC_BASE_URL to an OpenAI-compatible endpoint and ANTHROPIC_AUTH_TOKEN to the API token for the service.
I used Kimi-K2 on Moonshot [1] with Claude Code with no issues.
There's also Claude Code Router and similar apps for routing CC to a bunch of different models [2].
[1]: https://platform.moonshot.ai/
[2]: https://github.com/musistudio/claude-code-router
Imo, the point of custom CLIs is that each model is trained to handle tool calls differently. In my experience, the tool call performance is wildly different (although they have started converging recently). Convergence is meaningful only when the models and their performance are commoditized and we haven't reached that stage yet.
It focuses especially on large context and longer tasks with many steps.
1 - https://github.com/plandex-ai/plandex
I’ve found getting CC to farm out to subagents to be the only way to keep context under control, but would love to bring in a different model as another subagent to review the work of the others.
05%: Making code changes
10%: Running build pipelines
20%: Learning about changed process and people via zoom calls, teams chat and emails
15%: Raising incident tickets for issues outside of my control
20%: Submitting forms, attending reviews and chasing approvals
20%: Reaching out to people for dependencies, following up
10%: Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated
It’s true to say that time writing code is usually a minority of a developer’s work time, and so an AI that makes coding 20% faster may only translate to a modest dev productivity boost. But 5% time spent coding is a sign of serious organizational disfunction.
- Agentic DevOps: provisions infra and solves platform issues as soon as a support ticket is created.
- Agentic Technical Writer: one GenAI agent writes the docs and keeps the wiki up to date, while another 100 agents review it all and flag hallucinations.
- Agentic Manager: attends meetings, parses emails and logs 24x7 and creates daily reports, shares these reports with other teams, and manages the calendar of the developers to shield them from distractions.
- Agentic Director: spots patterns in the data and approves things faster, without the fear of getting fired.
- Agentic CEO: helps with decision-making, gives motivational speeches, and aligns vision with strategy.
- Agentic Pet: a virtual mascot you have to feed four times a day, Monday to Friday, from your office's IP address. Miss a meal and it dies, and HR gets notified. (This was my boss's idea)
The reality is, most engineers spend far less than half their time writing new code. This is where the 80/20 principle comes into play. It's common for 80% of a company's revenue to come from 20% of its features. That core, revenue-generating code is often mature and requires more maintenance than new code. Its stability allows the company to afford what you call "dysfunction": having a large portion of engineers work on speculative features and "big bets" that might never see the light of day.
So, while it looks like a bug from a pure "coding hours" perspective, for many businesses, it's a strategic feature!
I can point at a huge doc for some API and get the important things right away, or ask questions of it. I can get it to review PRs so I can quickly get the gist of the changes before digging into the code myself.
For coding, I don't find agents boost my productivity that much where I was already productive. However, they definitely allow me to do things I was unable to before (or would have taken very long as I wasn't an expert) – for example my type signatures have improved massively, in places where normally I would have been lazy and typed as any I now ask claude to come up with some proper types.
I've had it write code for things that I'm not great at, like geometry, or dataviz. But these are not necessarily increasing my productivity, they reduce my reliance on libraries and such, but they might actually make me less productive.
Deleted Comment
agentic coding will not fix these systemic issues caused by organizational dysfunction. agentic coding will allow the software created by these companies to be rewritten from scratch for 1/100th the cost with better reliability and performance though.
the resistance to AI adoption inside corporations that operate like this is intense and will probably intensify.
it takes a combination of external competitive pressure, investor pressure, attrition, PE takeovers, etc, to grind down internal resistance, which takes years or decades depending on the situation.
Cheaper yes. More reliable? Absolutely not. Not with today’s models at least.
Are you in heavily regulated industry or dysfunctional organization?
Most big tech optimize their build pipelines a lot to reduce commit to deploy (or validation/test process) which keeps engineers focus on the same task while problem/solution is fresh.
In the former case… I’m interested to hear how they’re better? Do you choose an agent with the full context of the changes to write the message, so it knows where you started, why certain things didn’t work? Or are you prompting a fresh context with your summary and asking it to make it into a commit message? Or something else?
I actually tried to use Qwen3[1] to analyse customer cases and it was worse than useless at it.
[1] We can't use any online model as these bug reports contain large amounts of PII, customer data, etc.
Many of those things could be improved today without AI but e.g. raising Incidents for issues outside of your control could also give you a suggestion already that you just have to tick off.
Not saying we are there yet but hard to imagine it's not possible.
It's probably messier than you think.
Also, you're not making an argument against agentic coding, you're actually making an argument for it - you don't have time to code, so you need someone or something to code for you.
- Running build pipelines: make cli tool to initiate them, monitor them and notify you on completion/error (audio). Allows to chain multiple things. Run in background terminal.
- Learning about changed process and people via zoom calls, teams chat and emails: pass logs of chats and emails to LLM with particular focus. Demand zoom calls transcripts published for that purposes (we use meet)
- Raising incident tickets for issues outside of my control: automate this with agent: allow it to access as much as needed, and guide it with short guidance - all doable via claude code + custom MCP
- Submitting forms, attending reviews and chasing approvals - best thing to automate. They want forms? They will have forms. Chasing approvals - fire and forget + queue management, same.
- Reaching out to people for dependencies, following up: LLM as personal assistant is classic job. Code this away.
- Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated: index all data and put it into RAG, let agent dig deeper.
Most of the time you spend is on scheduling micro-tasks, switching between them and maintaining unspoken queue of checking various saas frontends. Formalize micro-task management, automate endpoints, and delegate it to your own selfware (ad-hoc tools chain you vibe coded for yourself only, tailored for particular working environment).
I do this all (almost) to automate away non-coding tasks. Life is fun again.
Hope this helps.
Coding, debugging builds, paperwork, doc chasing are all tasks that AI is improving on rapidly.
There are so many problems in the world we need to stop cramming into the same bus.
In which case you'd have 1 markdown file and at least for the ones that are invoked via the CLI, just set up a Makefile entry point that leads them to the correct location.
Saw a repo recently with probably 80% of those
> This node.js CLI tool processes CLAUDE.md files with hierarchical collection and recursive @-import resolution. Walks directory tree from current to ~/.claude/, collecting all CLAUDE.md files and processing them with file import resolution. Saves processed context files with resolved imports next to the original CLAUDE.md files or in a specific location (configurable).
I mostly use Claude Code, but every now and then go with Gemini, and having to maintain two sets of (hierarchical) instructions was annoying. And then opencode showed up, which meant yet another tool I wanted to try out and …well.
Now I have a git repo I add as a submodule and tell each tool to read through and create their own WHATEVER.md
Library to help with this. Not great that a library is necessary, but useful until this converges to a standard (if it ever does).
Obsession?
This should be written on the coffin of full stack development.
Its a bit like the media cycle. The more jacked in you are, the more behind you feel. I’m less certain there will be winners as much as losers, but for sure the time investment on staying up to date on these things will not pay dividends to the average hn reader
assuming it doesn't all implode due to a lack of profitability, it should be obvious
That said, code+git+agent is only acceptable way for technical staff to interact with AI. Tools with sparkles button can go to hell.
https://a16z.com/llmflation-llm-inference-cost/https://openrouter.ai/anthropic/claude-sonnet-4
We don't actually need a winner, we need 2-3-4 big, mature commercial contenders for the state of the art stuff, and 2-3-4 big, mature Open Source/open weights models that can be run on decent consumer hardware at near real-time speeds, and we're all set.
Sure, there will probably be a long tail, but the average programmer probably won't care much about those, just like they don't care about Erlang, D, MoonScript, etc.
As Heraclitus said "The only constant in life is change"
(and maybe Emacs)
Figuring out how to stay sane while staying abreast of developments will be a key skill to cultivate.
I’m pretty skeptical there will be a single model with a defensible moat TBH. Like cloud compute, there is both economy of scale and room for multiple vendors (not least because bigco’s want multiple competing bids).
1. Where they can be used as autocompletion in an IDE at speeds comparable with Intellisense 2. And where they're good enough to generate most code reliably, while using a local LLM 3. While running on hardware costing in total max 2000€ 4. And definitely with just a few "standard" pre-configured Open Source/open weights LLMs where I don't have to become an LLM engineer to figure out the million knobs
I have no clue how Intellisense works behind the scenes, yet I use it every day. Same story here.
A look at fandom wikis is humbling. People will persist and go very deep into stuff they care about.
In this case: Read a lot, try to build a lot, learn, learn from mistakes, compare.
Oh, but it is.
Imagine you were then, back in those days. A few years after VHS won, you couldn't find your favorite movies on Betamax. There was a lot more hardware, and cheaper, available, for VHS.
Mass adoption largely wins out over almost everything.
Case in point from software: Visual Basic, PHP, Javascript, Python (though Python is slightly more technically sound than the other ones), early MySQL, MongoDB, early Windows, early Android.
I downloaded the 4bit quant to my mac studio 512GB. 7-8 minutes until first tokens with a big Cline prompt for it to chew on. Performance is exceptional. It nailed all the tool calls, loaded my memory bank, and reasoned about a golang code base well enough to write a blog post on the topic: https://convergence.ninja/post/blogs/000016-ForeverFantasyFr...
Writing blog posts is one of the tests I use for these models. It is a very involved process including a Q&A phase, drafting phase, approval, and deployment. The filenames follow a certain pattern. The file has to be uploaded to s3 in a certain location to trigger the deployment. It's a complex custom task that I automated.
Even the 4bit model was capable of this, but was incapable of actually working on my code, prefering to halucinate methods that would be convenient rather than admitting it didn't know what it was doing. This is the 4 bit "lobotomized" model though. I'm excited to see how it performs at full power.