Note

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

◇ obsidian_vault Reading Log February 12, 2026

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

TL;DR

DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), a dynamic sparse attention mechanism that reduces long-context inference cost by up to 2x while maintaining quality. Combined with scaled RL post-training and a novel agentic data synthesis pipeline, V3.2 achieves frontier-level performance comparable to GPT-5 and Gemini 3.0 Pro.

Key Facts

Architecture: Same as DeepSeek-V3.2-Exp; only modification from V3.1-Terminus is adding DSA
Context window: 128K tokens
arXiv: 2512.02556
High-compute variant: DeepSeek-V3.2-Speciale (reportedly surpasses GPT-5)

Three Key Technical Contributions

1. DeepSeek Sparse Attention (DSA)

An efficient attention mechanism that reduces O(L²) to O(L*k), where k << L.

The core insight: Per-head, per-sample attention is highly sparse (>90% near-zero entries), but the importance pattern varies with input and head — so static patterns don't work.

Stage 1: Lightning Indexer

A lightweight scoring module running in FP8 (low precision, very fast):

Multiple indexer heads (H_I) with learned weights and low-dimensional query/key projections (d^I << d)
Scoring formula: I_{t,s} = sum_{j=1}^{H_I} w_{t,j}^I * ReLU(q_{t,j}^I * k_s^I)
ReLU activation enables rapid token importance assessment without full-precision computation
Think of it as a cheap "pre-filter" that estimates which tokens matter

Stage 2: Top-k Selection + Sparse Attention

For each query token, retrieve KV pairs for only the top-k highest-scoring tokens
u_t = Attn(h_t, {c_s | I_{t,s} in Top-k(I_{t,:})})
k = 2,048 tokens selected per query across the full 128K context window
Full-precision softmax attention computed only over this sparse subset

DSA Training Procedure

Two stages after continued pre-training:

Dense Warm-up (1,000 steps): Model parameters frozen; indexer trained via KL divergence loss against L1-normalized aggregated attention scores as target distribution
Sparse Training (15,000 steps): All parameters optimized jointly; indexer refined to align with selected tokens only. LR = 7.3e-6, total 943.7B tokens processed

Integration with MLA

DSA operates within Multi-head Latent Attention's MQA (Multi-Query Attention) mode, where latent vectors are shared across query heads. Custom CUDA kernels and strategic data reuse ensure computational viability.

DSA vs Other Sparse Attention Methods

Method	Approach	Limitation
Local/sliding window	Fixed local context	Misses long-range dependencies
Strided/block-sparse	Fixed global pattern	Static — can't adapt to content
Random sparse	Random token selection	No guarantee of selecting important tokens
DSA	Learned, dynamic per-head per-sample selection	Content-adaptive; negligible quality loss

Practical result: Up to 2x cost reduction for long-context inference with quality parity on MMLU-Pro, GPQA Diamond, and long-context reasoning benchmarks.

2. Scalable Reinforcement Learning

Robust RL protocols with scaled post-training compute
Base V3.2 performs comparably to GPT-5
High-compute variant (Speciale) surpasses GPT-5, on par with Gemini 3.0 Pro

3. Agentic Task Data Synthesis

Novel synthesis pipeline that systematically generates training data at scale
Integrates reasoning into tool-use scenarios
Produces training data for complex agentic tasks

Benchmark Highlights

Gold medal on 2025 IMO (International Mathematical Olympiad)
Gold medal on IOI (International Olympiad in Informatics)
Performance comparable to or exceeding GPT-5 and Gemini 3.0 Pro

Connections

GLM-5 adopts DSA from this paper for its 200K context window: GLM-5 - From Vibe Coding to Agentic Engineering
DSA builds on the foundational attention mechanism from: Attention Is All You Need

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

TL;DR

Key Facts

Three Key Technical Contributions

1. DeepSeek Sparse Attention (DSA)

Stage 1: Lightning Indexer

Stage 2: Top-k Selection + Sparse Attention

DSA Training Procedure

Integration with MLA

DSA vs Other Sparse Attention Methods

2. Scalable Reinforcement Learning

3. Agentic Task Data Synthesis

Benchmark Highlights

Connections

My Notes