Note

Recursive Language Models

Recursive Language Models

TL;DR

RLMs let an LLM handle inputs up to two orders of magnitude beyond its context window by treating the prompt as an external environment variable and letting the model write code to recursively decompose, process, and aggregate over it. An 8B model post-trained this way approaches GPT-5 on long-context tasks.

Key Facts

  • Authors: Alex L. Zhang, Tim Kraska, Omar Khattab
  • Model: RLM-Qwen3-8B (post-trained from Qwen3-8B)
  • Training cost: 48 H100 hours, ~1,000 SFT samples
  • Key result: 8B model approaches GPT-5 quality on 3 of 4 long-context benchmarks
  • Code: Publicly available on GitHub

The Problem

Even with 128K-200K token context windows, many real tasks involve millions of tokens (e.g., searching 1K documents = 6-11M tokens). Existing solutions:

  • RAG/retrieval: Can't handle tasks requiring full-document reasoning, only good for needle-in-haystack
  • Map-reduce/summarization: Lossy — collapses on tasks requiring pairwise or aggregation reasoning
  • Longer context windows: O(L²) attention cost makes this expensive; DSA-style approaches help but don't solve 10M+ token tasks

Core Mechanism: REPL-Based Recursive Processing

Instead of feeding the entire prompt into the context, RLMs operate through a Read-Eval-Print Loop (REPL):

Algorithm

state ← InitREPL(prompt=P)              # prompt stored as variable, NOT in context
state ← AddFunction(state, sub_RLM_M)   # model can call itself via llm_query()
hist  ← [Metadata(state)]               # only metadata in context (length, prefix, access methods)

while True:
    code  ← LLM_M(hist)                 # model generates code to process prompt
    (state, stdout) ← REPL(state, code) # execute code, update REPL state
    hist  ← hist || code || Metadata(stdout)  # append only constant-size metadata
    if state[Final] is set:
        return state[Final]

Three Critical Design Choices

  1. Symbolic handle to prompt — the prompt P is a variable in the REPL, not pasted into the LLM's context window. The model accesses it programmatically (slicing, regex, indexing). This decouples processing capacity from context length.
  2. Variable storage — intermediate results stored in REPL variables, not generated autoregressively. This prevents context blowup when aggregating large results.
  3. Symbolic recursion — an llm_query() function lets the model call itself on constructed sub-problems inside arbitrary loops. The model can decompose any task into sub-tasks programmatically.

Completion Signaling

  • FINAL(answer) — returns a direct answer
  • FINAL_VAR(variable_name) — returns a REPL variable (useful for large outputs)

Current Limitation: Recursion Depth

Maximum recursion depth is currently 1 level — sub-calls are plain LMs, not RLMs. Deeper recursion is left to future work.

Emergent Strategies

The model learns (without explicit instruction) to:

  • Regex-based filtering: Uses its prior knowledge to write regex patterns that narrow search space before detailed reading
  • Uniform chunking by newlines: Splits structured text into manageable pieces
  • Variables as output buffers: Accumulates unbounded-length results in REPL variables across iterations
  • Recursive decomposition: Calls llm_query() on slices of the prompt, aggregates results

Training: RLM-Qwen3-8B

Data Generation

  1. Ran RLM(Qwen3-Coder-480B) on 750 English LongBenchPro tasks to generate expert trajectories
  2. Filtered out zero-score and single-turn trajectories (1,072 remaining)
  3. Split each root RLM turn into separate SFT samples (only root model REPL interaction, not sub-calls)
  4. Removed turns exceeding Qwen3-8B's ~100K character limit
  5. Applied programmatic corrections: 16% had FINAL answer errors, 13% had incorrect FINAL_VAR usage
  6. Final dataset: ~1,000 filtered samples

Training Details

  • Method: Supervised fine-tuning on corrected trajectories
  • Batch size: 64
  • Steps: 300
  • Infrastructure: 48 H100 hours
  • Library: prime-rl

Key Insight

Training focuses only on the root model's REPL interaction ability — learning to write code that decomposes problems and calls llm_query(). The sub-calls are handled by the base LM. "The major hurdle is learning to operate as the root model."

Benchmark Results

Tasks and Complexity

Task What it tests Complexity Input size
S-NIAH Needle in haystack across scales O(1) 2^13 to 2^18 tokens
BrowseComp-Plus (1K docs) Finding info across 1K web documents O(|P|) 6-11M tokens
OOLONG Aggregation reasoning O(|P|) 131K tokens
OOLONG-Pairs Pairwise comparison across entries O(|P|²) 32K tokens

Results

Task GPT-5 RLM Qwen3-480B RLM Notes
BrowseComp-Plus 91.3% 44.7% Beats RAG+BM25 by ~29%
OOLONG 56.5% 48.0% Summary agents get only 0.1% on pairs variant
OOLONG-Pairs 58.0% 23.1% Base models collapse; quadratic complexity

Key Takeaway

RLM-Qwen3-8B (8B parameters, 48 H100 hours of training) approaches vanilla GPT-5 on 3 of 4 long-context tasks. Small model + recursive decomposition ≈ frontier model.

Comparison to Alternatives

Approach Strength Weakness vs RLM
RAG / BM25 Fast retrieval Can't handle full-document reasoning; RLM beats by ~29%
Summarization agents Good on single-fact lookup (70.5% BrowseComp) Lossy — 0.1% on OOLONG-Pairs vs RLM's 58%
CodeAct + sub-calls Can execute code Lacks symbolic handle to prompt; can't scale beyond context
Longer context windows Simple O(L²) cost; still caps out; doesn't help at 6-11M tokens

Cost: Median RLM run is cheaper than median base model run. However, tail-end cost can spike significantly (outlier trajectories with many recursive calls).

Limitations

  1. Sequential execution is slowllm_query() calls are blocking; no async/parallel sub-calls yet
  2. FINAL() tagging is brittle — model sometimes outputs plans or thoughts as final answers
  3. Small models struggle — models without strong coding ability can't write effective REPL code
  4. Thinking models hit token limits — Qwen3-235B hit max output tokens on some tasks
  5. Variable quality across architectures — GPT-5 and Qwen3-Coder behave very differently as RLMs
  6. Max recursion depth = 1 — sub-calls are plain LMs, not RLMs; deeper recursion is future work

Connections

  • Addresses the same long-context problem that DSA tackles from the architecture side — RLMs solve it from the inference/algorithm side
  • Builds on Transformer O(L²) attention limitations as motivation
  • Conceptually related to agentic approaches in GLM-5

My Notes