Note

Recursive Language Models

◇ obsidian_vault Reading Log February 12, 2026

Recursive Language Models

TL;DR

RLMs let an LLM handle inputs up to two orders of magnitude beyond its context window by treating the prompt as an external environment variable and letting the model write code to recursively decompose, process, and aggregate over it. An 8B model post-trained this way approaches GPT-5 on long-context tasks.

Key Facts

Authors: Alex L. Zhang, Tim Kraska, Omar Khattab
Model: RLM-Qwen3-8B (post-trained from Qwen3-8B)
Training cost: 48 H100 hours, ~1,000 SFT samples
Key result: 8B model approaches GPT-5 quality on 3 of 4 long-context benchmarks
Code: Publicly available on GitHub

The Problem

Even with 128K-200K token context windows, many real tasks involve millions of tokens (e.g., searching 1K documents = 6-11M tokens). Existing solutions:

RAG/retrieval: Can't handle tasks requiring full-document reasoning, only good for needle-in-haystack
Map-reduce/summarization: Lossy — collapses on tasks requiring pairwise or aggregation reasoning
Longer context windows: O(L²) attention cost makes this expensive; DSA-style approaches help but don't solve 10M+ token tasks

Core Mechanism: REPL-Based Recursive Processing

Instead of feeding the entire prompt into the context, RLMs operate through a Read-Eval-Print Loop (REPL):

Algorithm

state ← InitREPL(prompt=P)              # prompt stored as variable, NOT in context
state ← AddFunction(state, sub_RLM_M)   # model can call itself via llm_query()
hist  ← [Metadata(state)]               # only metadata in context (length, prefix, access methods)

while True:
    code  ← LLM_M(hist)                 # model generates code to process prompt
    (state, stdout) ← REPL(state, code) # execute code, update REPL state
    hist  ← hist || code || Metadata(stdout)  # append only constant-size metadata
    if state[Final] is set:
        return state[Final]

Three Critical Design Choices

Symbolic handle to prompt — the prompt P is a variable in the REPL, not pasted into the LLM's context window. The model accesses it programmatically (slicing, regex, indexing). This decouples processing capacity from context length.
Variable storage — intermediate results stored in REPL variables, not generated autoregressively. This prevents context blowup when aggregating large results.
Symbolic recursion — an llm_query() function lets the model call itself on constructed sub-problems inside arbitrary loops. The model can decompose any task into sub-tasks programmatically.

Completion Signaling

FINAL(answer) — returns a direct answer
FINAL_VAR(variable_name) — returns a REPL variable (useful for large outputs)

Current Limitation: Recursion Depth

Maximum recursion depth is currently 1 level — sub-calls are plain LMs, not RLMs. Deeper recursion is left to future work.

Emergent Strategies

The model learns (without explicit instruction) to:

Regex-based filtering: Uses its prior knowledge to write regex patterns that narrow search space before detailed reading
Uniform chunking by newlines: Splits structured text into manageable pieces
Variables as output buffers: Accumulates unbounded-length results in REPL variables across iterations
Recursive decomposition: Calls llm_query() on slices of the prompt, aggregates results

Training: RLM-Qwen3-8B

Data Generation

Ran RLM(Qwen3-Coder-480B) on 750 English LongBenchPro tasks to generate expert trajectories
Filtered out zero-score and single-turn trajectories (1,072 remaining)
Split each root RLM turn into separate SFT samples (only root model REPL interaction, not sub-calls)
Removed turns exceeding Qwen3-8B's ~100K character limit
Applied programmatic corrections: 16% had FINAL answer errors, 13% had incorrect FINAL_VAR usage
Final dataset: ~1,000 filtered samples

Training Details

Method: Supervised fine-tuning on corrected trajectories
Batch size: 64
Steps: 300
Infrastructure: 48 H100 hours
Library: prime-rl

Key Insight

Training focuses only on the root model's REPL interaction ability — learning to write code that decomposes problems and calls llm_query(). The sub-calls are handled by the base LM. "The major hurdle is learning to operate as the root model."

Benchmark Results

Tasks and Complexity

Task	What it tests	Complexity	Input size
S-NIAH	Needle in haystack across scales	O(1)	2^13 to 2^18 tokens
BrowseComp-Plus (1K docs)	Finding info across 1K web documents	O(\|P\|)	6-11M tokens
OOLONG	Aggregation reasoning	O(\|P\|)	131K tokens
OOLONG-Pairs	Pairwise comparison across entries	O(\|P\|²)	32K tokens

Results

Task	GPT-5 RLM	Qwen3-480B RLM	Notes
BrowseComp-Plus	91.3%	44.7%	Beats RAG+BM25 by ~29%
OOLONG	56.5%	48.0%	Summary agents get only 0.1% on pairs variant
OOLONG-Pairs	58.0%	23.1%	Base models collapse; quadratic complexity

Key Takeaway

RLM-Qwen3-8B (8B parameters, 48 H100 hours of training) approaches vanilla GPT-5 on 3 of 4 long-context tasks. Small model + recursive decomposition ≈ frontier model.

Comparison to Alternatives

Approach	Strength	Weakness vs RLM
RAG / BM25	Fast retrieval	Can't handle full-document reasoning; RLM beats by ~29%
Summarization agents	Good on single-fact lookup (70.5% BrowseComp)	Lossy — 0.1% on OOLONG-Pairs vs RLM's 58%
CodeAct + sub-calls	Can execute code	Lacks symbolic handle to prompt; can't scale beyond context
Longer context windows	Simple	O(L²) cost; still caps out; doesn't help at 6-11M tokens

Cost: Median RLM run is cheaper than median base model run. However, tail-end cost can spike significantly (outlier trajectories with many recursive calls).

Limitations

Sequential execution is slow — llm_query() calls are blocking; no async/parallel sub-calls yet
FINAL() tagging is brittle — model sometimes outputs plans or thoughts as final answers
Small models struggle — models without strong coding ability can't write effective REPL code
Thinking models hit token limits — Qwen3-235B hit max output tokens on some tasks
Variable quality across architectures — GPT-5 and Qwen3-Coder behave very differently as RLMs
Max recursion depth = 1 — sub-calls are plain LMs, not RLMs; deeper recursion is future work

Connections

Addresses the same long-context problem that DSA tackles from the architecture side — RLMs solve it from the inference/algorithm side
Builds on Transformer O(L²) attention limitations as motivation
Conceptually related to agentic approaches in GLM-5

Recursive Language Models

Recursive Language Models

TL;DR

Key Facts

The Problem

Core Mechanism: REPL-Based Recursive Processing

Algorithm

Three Critical Design Choices

Completion Signaling

Current Limitation: Recursion Depth

Emergent Strategies

Training: RLM-Qwen3-8B

Data Generation

Training Details

Key Insight

Benchmark Results

Tasks and Complexity

Results

Key Takeaway

Comparison to Alternatives

Limitations

Connections

My Notes