Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these models.
The post-training methodology (Sec 3) is what really stands out to me. The idea of creating specialized 'expert models' for reasoning, agents, and chat, and then distilling their capabilities into a final unified model is a fascinating approach. It feels like a more structured way to solve the "jack of all trades, master of none" problem that can plague generalist models. Instead of just mixing all the data, they're essentially having a generalist learn from a committee of specialists.
A couple of the findings from their RL experiments are pure gold for anyone working in this space. The counter-intuitive result that a single-stage RL process at the full 64K context length outperforms a progressive, multi-stage approach (Fig 6) is a fantastic lesson. I've seen teams assume the opposite would be true. Also, the pragmatic choice to use an XML-like template for function calls to avoid JSON escaping hell (Fig 4) may be a small but brilliant engineering decision that makes a huge difference in practice. Wrangling escaped code inside JSON turns out to be a mess.
The performance on SWE-bench is impressive, putting it in the same league as much larger or proprietary models. What I’d love to see, and maybe others here have thoughts, is whether this hybrid training recipe holds up outside ARC-style evals. For example, do the agentic improvements transfer to messier, real-world workflows where APIs are undocumented, partial failures are common, and user input is full of ambiguity?
> Are all these post/mid-training tweaks important with abundant, verified, synthetic domain data?
No. Many are aimed at cleaning/aligning noisy, mixed-domain data. With abundant, high-quality domain data, you can skip most of the complexity and focus on direct SFT/RL on your corpus.
> Can a small team stick to scaling 2024-era best practices?
2024 was the year of SFT. I believe fitting reasoning traces to your final responses via RL is the technique-du-jour of 2025. Jumping from SFT to RL training might be biggest gain here if RL can be applied to your problem (e.g. math, coding etc).
The comment you're replying to is 100% AI-generated. How does obviously LLM-generated content continually make it to the front of HN, and why in God's name are you being downvoted for calling this out??
"...a fascinating approach..." (LLMs think everything is fascinating)
"...they're essentially having a generalist learn from a committee of specialists..." (analogies, analogies)
"...where APIs are undocumented, partial failures are common, and user input is full of ambiguity..." (typical AI rule of three template with semantically similar parameters that contribute nothing to the overall meaning)
I've been playing around with GLM-4.5 as a coding model for a while now and it's really, really good. In the coding agent I've been working on, Octofriend [1], I've sometimes had it on and confused it for Claude 4. Subjectively, my experience has been:
1. Claude is somewhat better at whole-codebase tasks, where you need to reason over a bunch of context and consider system interactions.
2. GLM-4.5 is somewhat better at being "honest" — i.e. I rarely see it doing the things Claude does like making broken tests pass by changing the test instead of fixing the bug.
Both are quite good though, and GLM-4.5 has found bugs that both Claude 4 Sonnet and 4.1 Opus have failed to catch. In general I think Claude wins a little more frequently on debugging tasks than GLM-4.5, but it's close.
Compared to GPT-5, both Claude and GLM feel like they're more consistent, although GPT-5 sometimes has long brilliant runs where it nails everything with subjectively higher code quality than either of the latter. However, once GPT-5 goes off the rails, it's hard to get it back on track, so it can be a bit frustrating to work with in comparison.
I just read your comment and decided to give GLM-4.5 a try in Kilocode. I'd been using Gemini CLI all day to try to resolve a tricky bug in some compiler code (a compiler for a subset of C that generates microcode for... a weird architecture, I'll leave it at that). So GLM-4.5 zoomed in on the problem right away. A problem that's eluded Gemini CLI all day. Gemini was leading me on a wild goose chase implicating a function that turns out wasn't the problem (and trying to make all kinds of lame changes to the function saying that would fix the problem - and it never did because the problem wasn't that function).
Sometimes getting a second pair of eyes to look at the problem helps and is usually not a judgement of smartness of the first pair of eyes. Seems like it also applies to coding agents.
I've had similarly good experiences with GLM-4.5 for smaller projects/requests. Unfortunately that did degrade with larger contexts, so I'm still treating it as a good fallback for Sonnet 4, rather than a full-blown replacement.
What’s your monthly bill (OpenRouter?) if I may ask? I have Claude Max and always on the lookout for alternatives, at least for the easier to solve problems.
I run a privacy-focused inference company, Synthetic [1], and I use our API of course :P I actually like GLM-4.5 enough that it's currently our default recommended model for new users. But yes, otherwise I'd use the official zai API most likely, or Fireworks. GLM-4.5-Air is quite good for a local model but GLM-4.5 is better; up to you if the tradeoff is worth it — there's definitely value in the data not ever leaving your machine, but it's not going to be as strong of a model.
Not OP. Chutes.ai charges $0.20 per 1M tokens. I don’t think it uses caching though because I ended up burning $30 in an hour or two. I had to move back to Claude Code.
So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:
> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.
This matches with the calculation I did for GLM-4.5 (355B A32B):
In [14]: 356732107008 - (775946240 * 2) # token_embd / output are 775946240 each. assume omitted
Out[14]: 355180214528
In [15]: 356732107008 - 339738624000 - (775946240 * 2) # parameters that are always active
Out[15]: 15441590528
In [16]: 339738624000 * 8 / 160 # parameters from activated experts
Out[16]: 16986931200.0
Meanwhile, GPT OSS series includes both the embedding layer and the output layer when counting the total parameters, but only includes the output layer when counting the active parameters:
> We refer to the models as “120b” and “20b” for simplicity, though they technically have 116.8B and 20.9B parameters, respectively. Unembedding parameters are counted towards active, but not embeddings.
And Qwen3 series includes both the embedding layer and the output layer when counting both the total parameters and the active parameters.
Why there is no standard in counting? Which approach is more accurate?
I'd say it depends. For the total parameter count, you should just count all parameters, since that's what matters for memory requirements.
For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.
Seems like we may get local, open, workstation-grade models that are useful for coding in a few years. By workstation-grade I mean a computer around 2000 USD, and by useful for coding I mean around Sonnet 4 level.
Current cloud based models are fun and useful, but a tool that is / will be so core to the developer experience, I want to be able to run locally.
This will be essential for the open source. Otherwise open source development will become unsustainable. I'm actually a little bit more optimistic. I think we will get something more than Sonnet 4 level in two years, that can be run on a $2,000 machine.
This feels like the first open model that doesn’t require significant caveats when comparing to frontier proprietary models. The parameter efficiency alone suggests some genuine innovations in training methodology. I am keen to see some independent verification of the results and to see how if does on Aider’s LLM Leaderboard.
The sheer number of things "they observed" in this paper that could be whole papers in themselves is astounding! Lots of great stuff in here around training processes and data collection+synthesis.
Does anyone have any background information on the authors? Have they published similarly impressive works in the past?
The post-training methodology (Sec 3) is what really stands out to me. The idea of creating specialized 'expert models' for reasoning, agents, and chat, and then distilling their capabilities into a final unified model is a fascinating approach. It feels like a more structured way to solve the "jack of all trades, master of none" problem that can plague generalist models. Instead of just mixing all the data, they're essentially having a generalist learn from a committee of specialists.
A couple of the findings from their RL experiments are pure gold for anyone working in this space. The counter-intuitive result that a single-stage RL process at the full 64K context length outperforms a progressive, multi-stage approach (Fig 6) is a fantastic lesson. I've seen teams assume the opposite would be true. Also, the pragmatic choice to use an XML-like template for function calls to avoid JSON escaping hell (Fig 4) may be a small but brilliant engineering decision that makes a huge difference in practice. Wrangling escaped code inside JSON turns out to be a mess.
The performance on SWE-bench is impressive, putting it in the same league as much larger or proprietary models. What I’d love to see, and maybe others here have thoughts, is whether this hybrid training recipe holds up outside ARC-style evals. For example, do the agentic improvements transfer to messier, real-world workflows where APIs are undocumented, partial failures are common, and user input is full of ambiguity?
Can a small team working on ASI/domain-specific stick to scaling 2024-era best practices training stack? Or will they miss massive improvements?
No. Many are aimed at cleaning/aligning noisy, mixed-domain data. With abundant, high-quality domain data, you can skip most of the complexity and focus on direct SFT/RL on your corpus.
> Can a small team stick to scaling 2024-era best practices?
2024 was the year of SFT. I believe fitting reasoning traces to your final responses via RL is the technique-du-jour of 2025. Jumping from SFT to RL training might be biggest gain here if RL can be applied to your problem (e.g. math, coding etc).
edit: looks like i'm not the first person to notice this either regarding this poster. https://news.ycombinator.com/item?id=44279662
I think we have a duty to call this out, before the web becomes ridden with slop.
(Re: Other post you linked to. it is entirely my own thoughts.)
"...a fascinating approach..." (LLMs think everything is fascinating)
"...they're essentially having a generalist learn from a committee of specialists..." (analogies, analogies)
"...where APIs are undocumented, partial failures are common, and user input is full of ambiguity..." (typical AI rule of three template with semantically similar parameters that contribute nothing to the overall meaning)
> ...are pure gold for anyone working in this space...
Specifically OpenAI
It felt interesting and informative to me, but I didn’t verify any of it.
Good eye btw.
1. Claude is somewhat better at whole-codebase tasks, where you need to reason over a bunch of context and consider system interactions.
2. GLM-4.5 is somewhat better at being "honest" — i.e. I rarely see it doing the things Claude does like making broken tests pass by changing the test instead of fixing the bug.
Both are quite good though, and GLM-4.5 has found bugs that both Claude 4 Sonnet and 4.1 Opus have failed to catch. In general I think Claude wins a little more frequently on debugging tasks than GLM-4.5, but it's close.
Compared to GPT-5, both Claude and GLM feel like they're more consistent, although GPT-5 sometimes has long brilliant runs where it nails everything with subjectively higher code quality than either of the latter. However, once GPT-5 goes off the rails, it's hard to get it back on track, so it can be a bit frustrating to work with in comparison.
1: https://github.com/synthetic-lab/octofriend
Deepseek R1 (does high level planning) combined with Qwen3 480B (does low level coding) or whatever is available from qwen code apis.
It's working great.
It solves 99.99% problem on tis own.
The seperation isn't very good in aider so i later plan to make my own tool to achieve better workflow.
deepseek r1+ qwen3 is close enough along with gemini2.5 pro
so i don't see any point of claude anymore
1: https://synthetic.new
Deleted Comment
> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.
This matches with the calculation I did for GLM-4.5 (355B A32B):
Meanwhile, GPT OSS series includes both the embedding layer and the output layer when counting the total parameters, but only includes the output layer when counting the active parameters:> We refer to the models as “120b” and “20b” for simplicity, though they technically have 116.8B and 20.9B parameters, respectively. Unembedding parameters are counted towards active, but not embeddings.
And Qwen3 series includes both the embedding layer and the output layer when counting both the total parameters and the active parameters.
Why there is no standard in counting? Which approach is more accurate?
For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.
correct activated params:
undercount activated params: overcount activated params:Does anyone have any background information on the authors? Have they published similarly impressive works in the past?