What stood out to me is how much of gpt-oss’s “newness” isn’t about radical architectural departures, but about a careful layering of well-understood optimizations—RoPE, SwiGLU, GQA, MoE—with some slightly unusual choices (tiny sliding-window sizes, few large experts instead of many small ones, per-head attention sinks).
The MXFP4 quantization detail might be the sleeper feature here. Getting 20B running on a 16 GB consumer card, or 120B on a single H100/MI300X without multi-GPU orchestration headaches, could be a bigger enabler for indie devs and researchers than raw benchmark deltas. A lot of experimentation never happens simply because the friction of getting the model loaded is too high.
One open question I’m curious about: given gpt-oss’s design bias toward reasoning (and away from encyclopedic recall), will we start seeing a formal split in open-weight model development—specialized “reasoners” that rely on tool use for facts, and “knowledge bases” tuned for retrieval-heavy work? That separation could change how we architect systems that wrap these models.
> will we start seeing a formal split in open-weight model development—specialized “reasoners” that rely on tool use for facts, and “knowledge bases” tuned for retrieval-heavy work?
My bet's on the former winning outright. It's very hard to outrun a good search engine, LLMs are inherently lossy so internal recall will never be perfect, and if you don't have to spend your parameter budget encoding information then you get to either spend that budget on being a much better reasoner, or you shrink the model and make it cheaper to run for the same capability. The trade-off is a more complex architecture, but that's happening anyway.
> that rely on tool use for facts, and “knowledge bases” tuned for retrieval-heavy work
I would say this isn't exclusive to the smaller OSS models. But rather a trait of Openai's models all together now.
This becomes especially apparent with the introduction of GPT-5 in ChatGPT. Their focus on routing your request to different modes and searching the web automatically (relying on an Agentic workflows in the background) is probably key to the overall quality of the output.
So far, it's quite easy to get their OSS models to follow instructions reliably. Qwen models has been pretty decent at this too for some time now.
I think if we give it another generation or two, we're at the point of having compotent enough models to start running more advanced agentic workflows. On modest hardware. We're almost there now, but not quite yet
Maybe not a architectural innovation, but both the Harmony format and splitting things into system/developer/user messages instead of just system/user messages, are both novel (in the released weights world) and different enough that I'm still in the process of updating my libraries so I can run fair benchmarks...
MXFP4's mixed precision approach (4-bit for weights, higher precision for KV cache) actually offers better accuracy/size tradeoffs than competing quantization methods like GPTQ or AWQ, which is why it enables these impressive resource profiles without the typical 4-bit degradation.
You seem to be conflating when you first heard about those techniques and when they first appeared. None of those techniques were first seen in Qwen, nor this specific combination of techniques.
Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding.
In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.
So presumably, this comes down to...
- training technique or data
- dimension
- lower number of large experts vs higher number of small experts
If I had to make a guess, I'd say this has much, much less to do with the architecture and far more to do with the data and training pipeline. Many have speculated that gpt-oss has adopted a Phi-like synthetic-only dataset and focused mostly on gaming metrics, and I've found the evidence so far to be sufficiently compelling.
That would be interesting. I've been a bit sceptical of the entire strategy from the beginning. If oss was actually as good as o3 mini and in some cases o4 mini outside benchmarks, that would undermine openai's api offer for gpt 5 nano and maybe mini too.
Edit: found this analysis, it's on the HN frontpage right now
> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.
Yes. I tried to ask oss-gpt to ask me a riddle. The response was absurd. Came up with a nonsensical question, then told me the answer. The answer was a four letter “word” that wasn’t actually a real word.
“What is the word that starts with S, ends with E, and contains A? → SAEA”
Then when I said that’s not a word and you gave me the answer already, no fun, it said
Qwen3 32B is a dense model, it uses all its parameters all the time. GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time. It’s a tradeoff that makes it faster to run than a dense 20B model and much smarter than a 3.6B one.
In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.
When people talk about sparse or dense models, are they spare or dense matrices in the conventional numerical linear algebra sense? (Something like a csr matrix?)
> GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time.
They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.
yesterday, i signed up for qwen3-coder-plus. It fails 4/10 "diff" edit format in various code editing tools i use.
Gemini Pro 2.5 with diff fenced edit format, rarely fails. So i don't see this Qwen3 hype unless i am using wrong edit format, can anyone tell me which edit format will work better with Qwen3?
I'm running a30-a3b-instruct q6 quant on exllamav2 and checked few simple tasks in roo and cline. Prompt adherence, tool calling and file changing worked flawlessly
Wow, Sebastian Raschk's blog articles are jewels - much appreciated.
I use the get-oss and qwen3 models a lot (smaller models locally using Ollama and LM Studio) and commercial APIs for the full size models.
For local model use, I get very good results with get-oss when I "over prompt," that is, I specify a larger amount of context information than I usually do. Qwen3 is simply awesome.
Until about three years ago, I have always understood neural network models (starting in the 1980s), GAN, Recurrent, LSTM, etc. well enough to write implementations. I really miss the feeling that I could develop at least simpler LLMs on my own. I am slowly working through Sebastian Raschk's excellent book https://www.manning.com/books/build-a-large-language-model-f... but I will probably never finish it (to be honest).
For me it is the opposite. I'm shocked by how simple transformer based models and how small the architectural differences are between the latest models. Almost nothing has changed since late 2023.
From my experience, qwen3-coder is way better. I only have gpt-oss:20b installed to make a few more tests but I give it a program to make a summary of what it does and qwen3 just works in a few seconds, while gpt-oss was cancelled after 5 minuts... doing nothing.
So I just use qwen3. Fast and great ouput. If for some reason I don't get what I need, I might use search engines or Perplexity.
I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.
Qwen3 coder 480B is quite good and on par with Sonnet 4. It’s the first time I realized the Chinese models are probably going to eclipse US-based models pretty soon, at least for coding.
Where do you use qwen3 480b from, I'm not even seeing it on Openrouter. EDIT nm, openrouter is just calling it qwen3-coder-- when I click for more info it shows it's Qwen3-Coder-480B-A35B-Instruct. And it's one of their free models. Nice
I've been using lightly gpt-oss-20b but what I've found is that for smaller (single sentence) prompts it was easy enough to have it loop infinitely. Since I'm running it with llama.cpp I've set a small repetition penalty and haven't encountered those issues since (I'm using it a couple of times a day to analyze diffs, so I might have just gotten lucky since)
I had the same issue with other models where they would loop repeating the same character, sentence or paragraph indefinitely. Turns out the context size some tools set by default is 2k and this is way too small.
I’ve been using the ollama version (uses about 13 Gb RAM on macOS) and haven’t had that issue yet. I wonder if that’s maybe an issue of the llama.cpp port?
I'm still in awe that a local 3090 gpu was able to run the qwen3 coder instruct 30b-a3b exl3 q6 and...
Was able to create a sample page, tried starting a server, recognising a leftover server was running, killing it (and forced a prompt for my permission), retrying and finding out it's ip for me to open in the browser.
This isn't a demo anymore. That's actually very useful help for interns/juniors already.
I find it interesting that the architectures of modern open weight LLMs are so similar, and that most innovation seems to be happening on the training (data, RL) front.
This is contrary to what I've seen in a large ML shop, where architectural tuning was king.
My guess is that at LLM scale, you really can't try to hyperparameter tune — it's just too expensive. You probably have to do some basic testing of different architectures, settle on one, and then figure out how to make best use of it (data and RL).
Good point. LLMs lower the barrier to entry if someone has enough resources because those architectures are more robust to tweaks given one throws enough compute and data at them. You can even violate scaling laws and still get a good model (like Llama 3 showed back then)
> At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96.)
Is it bad? It was trained on synthetic data with emphasis on coding and scientific thinking. Good on my opinion, that's what it can be used for. Not as universal do it all model.
The MXFP4 quantization detail might be the sleeper feature here. Getting 20B running on a 16 GB consumer card, or 120B on a single H100/MI300X without multi-GPU orchestration headaches, could be a bigger enabler for indie devs and researchers than raw benchmark deltas. A lot of experimentation never happens simply because the friction of getting the model loaded is too high.
One open question I’m curious about: given gpt-oss’s design bias toward reasoning (and away from encyclopedic recall), will we start seeing a formal split in open-weight model development—specialized “reasoners” that rely on tool use for facts, and “knowledge bases” tuned for retrieval-heavy work? That separation could change how we architect systems that wrap these models.
My bet's on the former winning outright. It's very hard to outrun a good search engine, LLMs are inherently lossy so internal recall will never be perfect, and if you don't have to spend your parameter budget encoding information then you get to either spend that budget on being a much better reasoner, or you shrink the model and make it cheaper to run for the same capability. The trade-off is a more complex architecture, but that's happening anyway.
I would say this isn't exclusive to the smaller OSS models. But rather a trait of Openai's models all together now.
This becomes especially apparent with the introduction of GPT-5 in ChatGPT. Their focus on routing your request to different modes and searching the web automatically (relying on an Agentic workflows in the background) is probably key to the overall quality of the output.
So far, it's quite easy to get their OSS models to follow instructions reliably. Qwen models has been pretty decent at this too for some time now.
I think if we give it another generation or two, we're at the point of having compotent enough models to start running more advanced agentic workflows. On modest hardware. We're almost there now, but not quite yet
They basically cloned Qwen3 on that, before adding the few tweaks you mention afterwards.
Oh, come on! GPT4 was rumoured to be an MoE well before Qwen even started releasing models. oAI didn't have to "clone" anything.
In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.
So presumably, this comes down to...
- training technique or data
- dimension
- lower number of large experts vs higher number of small experts
Edit: found this analysis, it's on the HN frontpage right now
> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.
https://x.com/jxmnop/status/1953899426075816164
“What is the word that starts with S, ends with E, and contains A? → SAEA”
Then when I said that’s not a word and you gave me the answer already, no fun, it said
“I do not have access to confirm that word.”
for example, i was using deep seek webui and getting decent on point answers but it simply does not have latest data.
So, while Deep Seek R1 might be better model than Grok3 or even Grok4, it not having access to "twitter data" basically puts it behind.
Same is case with OpenAI, if OpenAI has access to fast data from github, it can help with bugfixs which claude/gemini2.5 pro can't.
model can be smarter but if it does not have the data to base its inference upon it's useless.
In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.
When people talk about sparse or dense models, are they spare or dense matrices in the conventional numerical linear algebra sense? (Something like a csr matrix?)
They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.
sqrt(120*5) ~= 24
GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model
Gemini Pro 2.5 with diff fenced edit format, rarely fails. So i don't see this Qwen3 hype unless i am using wrong edit format, can anyone tell me which edit format will work better with Qwen3?
https://aider.chat/docs/more/edit-formats.html
gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't fit in the GPU entirely)
Qwen3 32B - 65 tok/sec
Qwen3 30B-A3B - 150 tok/sec
(all at 4-bit)
Which model, inference software and hardware are you running it on?
The 30BA3B variant flies on any GPU.
I use the get-oss and qwen3 models a lot (smaller models locally using Ollama and LM Studio) and commercial APIs for the full size models.
For local model use, I get very good results with get-oss when I "over prompt," that is, I specify a larger amount of context information than I usually do. Qwen3 is simply awesome.
Until about three years ago, I have always understood neural network models (starting in the 1980s), GAN, Recurrent, LSTM, etc. well enough to write implementations. I really miss the feeling that I could develop at least simpler LLMs on my own. I am slowly working through Sebastian Raschk's excellent book https://www.manning.com/books/build-a-large-language-model-f... but I will probably never finish it (to be honest).
So I just use qwen3. Fast and great ouput. If for some reason I don't get what I need, I might use search engines or Perplexity.
I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.
Qwen3-coder is amazing. Best I used so far.
I’d like to know how far the frontier models are from the local for agentic coding.
Asking because I'm looking for a good model that fits in 12GB VRAM.
Was able to create a sample page, tried starting a server, recognising a leftover server was running, killing it (and forced a prompt for my permission), retrying and finding out it's ip for me to open in the browser.
This isn't a demo anymore. That's actually very useful help for interns/juniors already.
This is contrary to what I've seen in a large ML shop, where architectural tuning was king.
Deleted Comment
Tencent's hunyuan-turbos, another hybrid, is currently ranked at 22. https://arxiv.org/abs/2505.15431