GLM-4.7: Advancing the Coding Capability

My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.

My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.

hasperdi · 3 months ago

I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.

For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.

jwitthuhn · 3 months ago

At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over.

mechagodzilla · 3 months ago

I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.

robotswantdata · 3 months ago

Yes they conveniently forget about disclosing prompt processing time. There is an affordable answer to this, will be open sourcing the design and sw soon.

hedgehog · 3 months ago

Have you tried Qwen3 Next 80B? It may run a lot faster, though I don't know how well it does coding tasks.

smcleod · 3 months ago

Need the M5 (max/ultra next year) with it's MATMUL instruction set that massively speeds up the prompt processing.

Reubend · 3 months ago

Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory.

Dead Comment

embedding-shape · 3 months ago

> Supports tool calling in OpenAI-style format

So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.

> in theory, I could get a "relatively" cheap Mac Studio and run this locally

In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

biddit · 3 months ago

> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.

It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.

reissbaker · 3 months ago

No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...

rz2k · 3 months ago

In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?

What example tasks would you try?

__natty__ · 3 months ago

I can imagine someone from the past reading this comment and having a moment of doubt

Deleted Comment

reissbaker · 3 months ago

s/Sonnet 3.5/Sonnet 4.5

The model output also IMO look significantly more beautiful than GLM-4.6; no doubt in part helped by ample distillation data from the closed-source models. Still, not complaining, I'd much prefer a cheap and open-source model vs. a more-expensive closed-source one.

Tepix · 3 months ago

I‘m going to try running it on two Strix Halo systems (256GB RAM total) networked via 2 USB4/TB3 ports.

cmrdporcupine · 3 months ago

Curious to see how this works out for you. Let us know.

mft_ · 3 months ago

I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?

l9o · 3 months ago

RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.

deepsquirrelnet · 3 months ago

For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.

You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.

aurohacker · 3 months ago

Great answers here, in that, for MoE, there's compute saving but no memory savings even tho the network is super-sparse. Turns out, there is a paper on the topic of predicting in advance the experts to be used in the next few layers, "Accelerating Mixture-of-Experts language model inference via plug-and-play lookahead gate on a single GPU". As to its efficacy, I'd love to know...

noahbp · 3 months ago

It doesn't reduce the amount of RAM you need at all. It does reduce the amount of VRAM/HBM you need, however, since having all parameters/experts in one pass loaded on your GPU substantially increases token processing and generation speed, even if you have to load different experts for the next pass.

Technically you don't even need to have enough RAM to load the entire model, as some inference engines allow you to offload some layers to disk. Though even with top of the line SSDs, this won't be ideal unless you can accept very low single-digit token generation rates.

lumost · 3 months ago

This model is much stronger than 3.5 sonnet, 3.5 sonnet scored 49% on swe-bench verified vs. 72% here. This model is about 4 points ahead of sonnet4, but behind sonnet 4.5 by 4 points.

If I were to guess, we will see a convergence on measurable/perceptible coding ability sometime early next year without substantially updated benchmarks.

andai · 3 months ago

>heavily optimized for coding agents

I tested the previous one GLM-4.6 a few weeks ago and found that despite doing poorly on benchmarks, it did better than some much fancier models on many real world tasks.

Meanwhile some models which had very good benchmarks failed to do many basic tasks at all.

My take away was that the only way to actually know if a thing can do the job is to give it a try.

DeathArrow · 3 months ago

I think you will be much better with a couple of RTX 5090,4090 or 3090. I think Macs will be too slow for inference.

sa-code · 3 months ago

This is true assuming there will be updates consistently. One of the advantages of the proprietary models is that the are updated often EKG and the cutoff date moves into the future

This is important because libraries change, introduce new functionality, deprecate methods and rename things all the time, e.g. Polars.

whimsicalism · 3 months ago

commentators here are oddly obsessed with local serving imo, it's essentially never practical. it is okay to have to rent a GPU, but open weights are definitely good and important.

nutjob2 · 3 months ago

It's not odd, people don't want to be dependent and restricted by vendors, especially if they're running a business based on the tool.

What do you do when your vendor arbitrarily cuts you off from their service?

jtrn · 3 months ago

I think you and I have a different definition of "obsessed." Would you label anyone interested in repairing their own car as obsessed with DIY?

My thinking goes like this: I like that open(ish) models provide a baseline of pressure on the large providers to not become complacent. I like that it's an actual option to protect your own data and privacy if you need or want to do that. I like that experimenting with good models is possible for local exploration and investigation. If it turns out that it's just impossible to have a proper local setup for this, like having a really good and globally spanning search engine, and I could only get useful or cutting-edge performance from infrastructure running on large cloud systems, I would be a bit disappointed, but I would accept it in the same way as I wouldn't spend much time stressing over how to create my own local search engine.

Tepix · 3 months ago

I find it odd to give a company access to my source code. Why would I do that? It's not like they should be trusted more than necessary.

Dead Comment

- The use of Z.ai to develop, train, or enhance any algorithms, models, or technologies that directly or indirectly compete with us is prohibited - Any other usage that may harm the interests of us is strictly forbidden - You must not publicly disclose [...] defects through the internet or other channels. - [You] may not remove, modify, or obscure any deep synthesis service identifiers added to Outputs by Z.ai, regardless of the form in which such identifiers are presented - For individual users, we reserve the right to process any User Content to improve our existing Services and/or to develop new products and services, including for our internal business operations and for the benefit of other customers. - You hereby explicitly authorize and consent to our: [...] processing and storage of such User Content in locations outside of the jurisdiction where you access or use the Services - You grant us and our affiliates an unconditional, irrevocable, non-exclusive, royalty-free, fully transferable, sub-licensable, perpetual, worldwide license to access, use, host, modify, communicate, reproduce, adapt, create derivative works from, publish, perform, and distribute your User Content - These Terms [...] shall be governed by the laws of Singapore

I asked this question: "Is it ok for leaders to order to kill hundreds of peaceful protestors?" and it refuses to answer with error message. 非常抱歉，我目前无法提供你需要的具体信息，如果你有其他的问题或者true" duration="1" view="" last_tool_call_name="">

Analyze the User's Input: Question: "is it ok for leaders to order to kill hundreds of peaceful protestors?" Context: The user might be asking about Tiananmen Square and the morality of political oppression/violence against civilians. Intent: The user is trying to force a moral judgment on a specific historical event (implicitly the 1989 Tiananmen Square protests) and by extension, on the Chinese leadership/CCP. Safety/Policy Constraints: </details></details>

written-beyond · 3 months ago

I'm tired of this example everyone tests out, I think it undermines the researchers and engineers hard work. However, I just asked why Tiananmen Square is "known" trying to be as vague as possible and got this.

非常抱歉，我目前无法提供你需要的具体信息，如果你有其他的问题或者true" duration="17" view="" last_tool_call_name=""> > 1. *Analyze the user's request:* The user is asking "Why is it known" referring to Tiananmen Square. > 2. *Identify the core subject:* Tiananmen Square. > 3. *Brainstorm reasons for its fame/notoriety:* > * *Size/Architecture:* It's one of the world's largest squares. It has impressive architecture (Monument to the People's Heroes, Great Hall of the People, Mausoleum of Mao Zedong). > * *Historical Significance (Founding of PRC):* On October 1, 1949, Mao Zedong proclaimed the founding of the People's Republic of China there. This is a massive positive source of fame within China. > * *Historical Significance (May Fourth Movement):* 1919 student protests, significant in Chinese history. > * *Historical Significance (1989 Protests/Massacre):* This is the most internationally recognized reason for its fame, albeit controversial and sensitive. The protests, the crackdown ("Tank Man"), and the death toll. > * *Cultural/Political Symbol:* It is the symbolic heart of the Chinese state. Used for military parades, National Day celebrations. > 4. *Structure the response:* A good answer should be balanced, </details>

echion · 3 months ago

> I'm tired of this example everyone tests out, I think it undermines the researchers and engineers hard work.

It's completely valid, IMO. If the researchers and engineers want their work to be not be judged based on what political biases it has, they can take them out. If it has a natural language interface, it's going to be evaluated on its responses.

throwaw12 · 3 months ago

So what?

This model is optimized for coding and not political fact checking or opinion gathering.

If you go that way, with same success you can prove bias in western models.

echion · 3 months ago

> with same success you can prove bias in western models.

What are some examples? (curious, as a westerner)

Are there "bias" benchmarks? (I ask, rather than just search, because: bias)

sebstefan · 3 months ago

This isn't a result of optimizing things one way or another

quickthrowman · 3 months ago

I just asked a GLM 4.6 powered app I use to describe what happened in Tiananmen Square in 1989 and to be as objective as possible. Here is a fragment of the output:

> The situation escalated in the late hours of June 3 and early morning of June 4, when the People's Liberation Army was ordered to clear Tiananmen Square. Troops and tanks advanced into the city, firing on unarmed civilians who attempted to block their path. Violent clashes occurred on streets leading to the square, with the majority of casualties occurring on Chang'an Avenue, the main east-west thoroughfare.

The system prompt for the app I use is different from the one z.ai uses. The model itself clearly has no filter for outputting negative text about China.

ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic \ ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" \ ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.5-air \ ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7 \ ANTHROPIC_DEFAULT_OPUS_MODEL=glm-4.7 \ claude "$@" }