Note

GLM-5: From Vibe Coding to Agentic Engineering

GLM-5: From Vibe Coding to Agentic Engineering

TL;DR

Zhipu AI (Z.ai) released GLM-5, a 744B parameter MoE model (40B active) targeting complex systems engineering and long-horizon agentic tasks. Open-sourced under MIT license. Best-in-class among open-source models on reasoning, coding, and agentic benchmarks.

Key Facts

  • Parameters: 744B total, 40B active (MoE)
  • Pre-training data: 28.5T tokens (up from 23T in GLM-4.5)
  • Context window: 200K tokens (uses DeepSeek Sparse Attention)
  • License: MIT
  • Hardware: Trained entirely on Huawei Ascend chips (MindSpore framework) — no NVIDIA dependency
  • Weights: HuggingFace / ModelScope
  • Pricing (OpenRouter): ~$0.80-1.00/M input, ~$2.56-3.20/M output

Scaling from GLM-4.5

GLM-4.5 GLM-5
Total params 355B 744B
Active params 32B 40B
Pre-training tokens 23T 28.5T

Key Technical Contributions

Slime — Asynchronous RL Infrastructure

  • Novel async RL infrastructure for post-training
  • Improves training throughput and efficiency
  • Enables more fine-grained post-training iterations
  • Addresses the challenge of deploying RL at scale for LLMs

DeepSeek Sparse Attention (DSA)

GLM-5 integrates DSA from the DeepSeek-V3.2 Paper to enable affordable 200K context.

The problem: Standard attention is O(L²) — every token attends to every other token. But in practice >90% of attention weights are near-zero, and which tokens matter varies per input and per head.

Two-stage pipeline:

  1. Lightning Indexer — a cheap FP8 scoring module that quickly estimates token importance:

    • Uses multiple small projection heads (dimension d^I << d)
    • Scoring formula: I_{t,s} = sum_{j=1}^{H_I} w_{t,j}^I * ReLU(q_{t,j}^I * k_s^I)
    • Multiple indexer heads with learned weights, combined via ReLU activation
    • Runs in FP8 for speed — acts as a fast "pre-filter"
  2. Top-k Selection + Sparse Attention — full-precision attention on only the selected tokens:

    • For each query token, picks only the top-k highest-scoring candidates
    • In practice: 2,048 tokens selected per query across a 128K context window
    • u_t = Attn(h_t, {c_s | I_{t,s} in Top-k(I_{t,:})})
    • Reduces complexity from O(L²) to O(L*k)

Training procedure (two stages after continued pre-training):

  1. Dense Warm-up (1,000 steps): Model frozen, indexer trained via KL divergence against aggregated attention scores (L1-normalized attention weights as target)
  2. Sparse Training (15,000 steps): All parameters optimized jointly, indexer refined. LR=7.3e-6, 943.7B tokens processed

Why better than older sparse methods: Local windows miss long-range dependencies; fixed patterns (strided, block-sparse) can't adapt to content. DSA is dynamic and content-adaptive per head and per sample.

Practical impact: Up to 2x cost reduction for long-context inference with negligible quality loss. This is how GLM-5 offers 200K context affordably.

See: DeepSeek-V3.2 Paper for the full paper.

Benchmark Highlights

Reasoning

  • HLE (text-only): 30.5 (vs Claude Opus 4.5: 28.4, GPT-5.2: 35.4)
  • HLE w/ Tools: 50.4 (vs Claude Opus 4.5: 43.4, GPT-5.2: 45.5)
  • AIME 2026 I: 92.7
  • GPQA-Diamond: 86.0

Coding

  • SWE-bench Verified: 77.8 (vs Claude Opus 4.5: 80.9, GPT-5.2: 80.0)
  • SWE-bench Multilingual: 73.3 (vs Claude Opus 4.5: 77.5)
  • Terminal-Bench 2.0 (Claude Code): 56.2 / 61.1 verified (vs Claude Opus 4.5: 57.9)
  • CyberGym: 43.2 (vs Claude Opus 4.5: 50.6)

Agentic

  • BrowseComp w/ Context Manage: 75.9 (vs Claude Opus 4.5: 67.8)
  • Vending Bench 2: $4,432 (vs Claude Opus 4.5: $4,967, Gemini 3.0 Pro: $5,478)
  • tau2-Bench: 89.7 (vs Claude Opus 4.5: 91.6)

Notable Observations

  • Best open-source model on most benchmarks, closing gap with frontier closed models
  • BrowseComp (with context management) is a standout — beats all closed models listed
  • HLE with tools also beats closed models — strong tool-use capability
  • Coding still slightly behind Claude Opus 4.5 and GPT-5.2 but competitive
  • Vending Bench 2 (long-horizon planning): competitive with frontier, #1 open-source
  • Can generate .docx, .pdf, .xlsx files directly — "Office" capability
  • Compatible with Claude Code, OpenClaw, and other coding agents
  • Supports non-NVIDIA chips: Huawei Ascend, Moore Threads, Cambricon, etc.

My Notes