Note
GLM-5: From Vibe Coding to Agentic Engineering
GLM-5: From Vibe Coding to Agentic Engineering
TL;DR
Zhipu AI (Z.ai) released GLM-5, a 744B parameter MoE model (40B active) targeting complex systems engineering and long-horizon agentic tasks. Open-sourced under MIT license. Best-in-class among open-source models on reasoning, coding, and agentic benchmarks.
Key Facts
- Parameters: 744B total, 40B active (MoE)
- Pre-training data: 28.5T tokens (up from 23T in GLM-4.5)
- Context window: 200K tokens (uses DeepSeek Sparse Attention)
- License: MIT
- Hardware: Trained entirely on Huawei Ascend chips (MindSpore framework) — no NVIDIA dependency
- Weights: HuggingFace / ModelScope
- Pricing (OpenRouter): ~$0.80-1.00/M input, ~$2.56-3.20/M output
Scaling from GLM-4.5
| GLM-4.5 | GLM-5 | |
|---|---|---|
| Total params | 355B | 744B |
| Active params | 32B | 40B |
| Pre-training tokens | 23T | 28.5T |
Key Technical Contributions
Slime — Asynchronous RL Infrastructure
- Novel async RL infrastructure for post-training
- Improves training throughput and efficiency
- Enables more fine-grained post-training iterations
- Addresses the challenge of deploying RL at scale for LLMs
DeepSeek Sparse Attention (DSA)
GLM-5 integrates DSA from the DeepSeek-V3.2 Paper to enable affordable 200K context.
The problem: Standard attention is O(L²) — every token attends to every other token. But in practice >90% of attention weights are near-zero, and which tokens matter varies per input and per head.
Two-stage pipeline:
Lightning Indexer — a cheap FP8 scoring module that quickly estimates token importance:
- Uses multiple small projection heads (dimension d^I << d)
- Scoring formula:
I_{t,s} = sum_{j=1}^{H_I} w_{t,j}^I * ReLU(q_{t,j}^I * k_s^I) - Multiple indexer heads with learned weights, combined via ReLU activation
- Runs in FP8 for speed — acts as a fast "pre-filter"
Top-k Selection + Sparse Attention — full-precision attention on only the selected tokens:
- For each query token, picks only the top-k highest-scoring candidates
- In practice: 2,048 tokens selected per query across a 128K context window
u_t = Attn(h_t, {c_s | I_{t,s} in Top-k(I_{t,:})})- Reduces complexity from O(L²) to O(L*k)
Training procedure (two stages after continued pre-training):
- Dense Warm-up (1,000 steps): Model frozen, indexer trained via KL divergence against aggregated attention scores (L1-normalized attention weights as target)
- Sparse Training (15,000 steps): All parameters optimized jointly, indexer refined. LR=7.3e-6, 943.7B tokens processed
Why better than older sparse methods: Local windows miss long-range dependencies; fixed patterns (strided, block-sparse) can't adapt to content. DSA is dynamic and content-adaptive per head and per sample.
Practical impact: Up to 2x cost reduction for long-context inference with negligible quality loss. This is how GLM-5 offers 200K context affordably.
See: DeepSeek-V3.2 Paper for the full paper.
Benchmark Highlights
Reasoning
- HLE (text-only): 30.5 (vs Claude Opus 4.5: 28.4, GPT-5.2: 35.4)
- HLE w/ Tools: 50.4 (vs Claude Opus 4.5: 43.4, GPT-5.2: 45.5)
- AIME 2026 I: 92.7
- GPQA-Diamond: 86.0
Coding
- SWE-bench Verified: 77.8 (vs Claude Opus 4.5: 80.9, GPT-5.2: 80.0)
- SWE-bench Multilingual: 73.3 (vs Claude Opus 4.5: 77.5)
- Terminal-Bench 2.0 (Claude Code): 56.2 / 61.1 verified (vs Claude Opus 4.5: 57.9)
- CyberGym: 43.2 (vs Claude Opus 4.5: 50.6)
Agentic
- BrowseComp w/ Context Manage: 75.9 (vs Claude Opus 4.5: 67.8)
- Vending Bench 2: $4,432 (vs Claude Opus 4.5: $4,967, Gemini 3.0 Pro: $5,478)
- tau2-Bench: 89.7 (vs Claude Opus 4.5: 91.6)
Notable Observations
- Best open-source model on most benchmarks, closing gap with frontier closed models
- BrowseComp (with context management) is a standout — beats all closed models listed
- HLE with tools also beats closed models — strong tool-use capability
- Coding still slightly behind Claude Opus 4.5 and GPT-5.2 but competitive
- Vending Bench 2 (long-horizon planning): competitive with frontier, #1 open-source
- Can generate .docx, .pdf, .xlsx files directly — "Office" capability
- Compatible with Claude Code, OpenClaw, and other coding agents
- Supports non-NVIDIA chips: Huawei Ascend, Moore Threads, Cambricon, etc.