Note

GLM-5: From Vibe Coding to Agentic Engineering

◇ obsidian_vault Reading Log February 12, 2026

GLM-5: From Vibe Coding to Agentic Engineering

TL;DR

Zhipu AI (Z.ai) released GLM-5, a 744B parameter MoE model (40B active) targeting complex systems engineering and long-horizon agentic tasks. Open-sourced under MIT license. Best-in-class among open-source models on reasoning, coding, and agentic benchmarks.

Key Facts

Parameters: 744B total, 40B active (MoE)
Pre-training data: 28.5T tokens (up from 23T in GLM-4.5)
Context window: 200K tokens (uses DeepSeek Sparse Attention)
License: MIT
Hardware: Trained entirely on Huawei Ascend chips (MindSpore framework) — no NVIDIA dependency
Weights: HuggingFace / ModelScope
Pricing (OpenRouter): ~$0.80-1.00/M input, ~$2.56-3.20/M output

Scaling from GLM-4.5

	GLM-4.5	GLM-5
Total params	355B	744B
Active params	32B	40B
Pre-training tokens	23T	28.5T

Key Technical Contributions

Slime — Asynchronous RL Infrastructure

Novel async RL infrastructure for post-training
Improves training throughput and efficiency
Enables more fine-grained post-training iterations
Addresses the challenge of deploying RL at scale for LLMs

DeepSeek Sparse Attention (DSA)

GLM-5 integrates DSA from the DeepSeek-V3.2 Paper to enable affordable 200K context.

The problem: Standard attention is O(L²) — every token attends to every other token. But in practice >90% of attention weights are near-zero, and which tokens matter varies per input and per head.

Two-stage pipeline:

Lightning Indexer — a cheap FP8 scoring module that quickly estimates token importance:
- Uses multiple small projection heads (dimension d^I << d)
- Scoring formula: I_{t,s} = sum_{j=1}^{H_I} w_{t,j}^I * ReLU(q_{t,j}^I * k_s^I)
- Multiple indexer heads with learned weights, combined via ReLU activation
- Runs in FP8 for speed — acts as a fast "pre-filter"
Top-k Selection + Sparse Attention — full-precision attention on only the selected tokens:
- For each query token, picks only the top-k highest-scoring candidates
- In practice: 2,048 tokens selected per query across a 128K context window
- u_t = Attn(h_t, {c_s | I_{t,s} in Top-k(I_{t,:})})
- Reduces complexity from O(L²) to O(L*k)

Training procedure (two stages after continued pre-training):

Dense Warm-up (1,000 steps): Model frozen, indexer trained via KL divergence against aggregated attention scores (L1-normalized attention weights as target)
Sparse Training (15,000 steps): All parameters optimized jointly, indexer refined. LR=7.3e-6, 943.7B tokens processed

Why better than older sparse methods: Local windows miss long-range dependencies; fixed patterns (strided, block-sparse) can't adapt to content. DSA is dynamic and content-adaptive per head and per sample.

Practical impact: Up to 2x cost reduction for long-context inference with negligible quality loss. This is how GLM-5 offers 200K context affordably.

See: DeepSeek-V3.2 Paper for the full paper.

Benchmark Highlights

Reasoning

HLE (text-only): 30.5 (vs Claude Opus 4.5: 28.4, GPT-5.2: 35.4)
HLE w/ Tools: 50.4 (vs Claude Opus 4.5: 43.4, GPT-5.2: 45.5)
AIME 2026 I: 92.7
GPQA-Diamond: 86.0

Coding

SWE-bench Verified: 77.8 (vs Claude Opus 4.5: 80.9, GPT-5.2: 80.0)
SWE-bench Multilingual: 73.3 (vs Claude Opus 4.5: 77.5)
Terminal-Bench 2.0 (Claude Code): 56.2 / 61.1 verified (vs Claude Opus 4.5: 57.9)
CyberGym: 43.2 (vs Claude Opus 4.5: 50.6)

Agentic

BrowseComp w/ Context Manage: 75.9 (vs Claude Opus 4.5: 67.8)
Vending Bench 2: $4,432 (vs Claude Opus 4.5: $4,967, Gemini 3.0 Pro: $5,478)
tau2-Bench: 89.7 (vs Claude Opus 4.5: 91.6)

Notable Observations

Best open-source model on most benchmarks, closing gap with frontier closed models
BrowseComp (with context management) is a standout — beats all closed models listed
HLE with tools also beats closed models — strong tool-use capability
Coding still slightly behind Claude Opus 4.5 and GPT-5.2 but competitive
Vending Bench 2 (long-horizon planning): competitive with frontier, #1 open-source
Can generate .docx, .pdf, .xlsx files directly — "Office" capability
Compatible with Claude Code, OpenClaw, and other coding agents
Supports non-NVIDIA chips: Huawei Ascend, Moore Threads, Cambricon, etc.

GLM-5: From Vibe Coding to Agentic Engineering

GLM-5: From Vibe Coding to Agentic Engineering

TL;DR

Key Facts

Scaling from GLM-4.5

Key Technical Contributions

Slime — Asynchronous RL Infrastructure

DeepSeek Sparse Attention (DSA)

Benchmark Highlights

Reasoning

Coding

Agentic

Notable Observations

My Notes