Note
Recursive Language Models
Recursive Language Models
TL;DR
RLMs let an LLM handle inputs up to two orders of magnitude beyond its context window by treating the prompt as an external environment variable and letting the model write code to recursively decompose, process, and aggregate over it. An 8B model post-trained this way approaches GPT-5 on long-context tasks.
Key Facts
- Authors: Alex L. Zhang, Tim Kraska, Omar Khattab
- Model: RLM-Qwen3-8B (post-trained from Qwen3-8B)
- Training cost: 48 H100 hours, ~1,000 SFT samples
- Key result: 8B model approaches GPT-5 quality on 3 of 4 long-context benchmarks
- Code: Publicly available on GitHub
The Problem
Even with 128K-200K token context windows, many real tasks involve millions of tokens (e.g., searching 1K documents = 6-11M tokens). Existing solutions:
- RAG/retrieval: Can't handle tasks requiring full-document reasoning, only good for needle-in-haystack
- Map-reduce/summarization: Lossy — collapses on tasks requiring pairwise or aggregation reasoning
- Longer context windows: O(L²) attention cost makes this expensive; DSA-style approaches help but don't solve 10M+ token tasks
Core Mechanism: REPL-Based Recursive Processing
Instead of feeding the entire prompt into the context, RLMs operate through a Read-Eval-Print Loop (REPL):
Algorithm
state ← InitREPL(prompt=P) # prompt stored as variable, NOT in context
state ← AddFunction(state, sub_RLM_M) # model can call itself via llm_query()
hist ← [Metadata(state)] # only metadata in context (length, prefix, access methods)
while True:
code ← LLM_M(hist) # model generates code to process prompt
(state, stdout) ← REPL(state, code) # execute code, update REPL state
hist ← hist || code || Metadata(stdout) # append only constant-size metadata
if state[Final] is set:
return state[Final]
Three Critical Design Choices
- Symbolic handle to prompt — the prompt P is a variable in the REPL, not pasted into the LLM's context window. The model accesses it programmatically (slicing, regex, indexing). This decouples processing capacity from context length.
- Variable storage — intermediate results stored in REPL variables, not generated autoregressively. This prevents context blowup when aggregating large results.
- Symbolic recursion — an
llm_query()function lets the model call itself on constructed sub-problems inside arbitrary loops. The model can decompose any task into sub-tasks programmatically.
Completion Signaling
FINAL(answer)— returns a direct answerFINAL_VAR(variable_name)— returns a REPL variable (useful for large outputs)
Current Limitation: Recursion Depth
Maximum recursion depth is currently 1 level — sub-calls are plain LMs, not RLMs. Deeper recursion is left to future work.
Emergent Strategies
The model learns (without explicit instruction) to:
- Regex-based filtering: Uses its prior knowledge to write regex patterns that narrow search space before detailed reading
- Uniform chunking by newlines: Splits structured text into manageable pieces
- Variables as output buffers: Accumulates unbounded-length results in REPL variables across iterations
- Recursive decomposition: Calls
llm_query()on slices of the prompt, aggregates results
Training: RLM-Qwen3-8B
Data Generation
- Ran RLM(Qwen3-Coder-480B) on 750 English LongBenchPro tasks to generate expert trajectories
- Filtered out zero-score and single-turn trajectories (1,072 remaining)
- Split each root RLM turn into separate SFT samples (only root model REPL interaction, not sub-calls)
- Removed turns exceeding Qwen3-8B's ~100K character limit
- Applied programmatic corrections: 16% had FINAL answer errors, 13% had incorrect FINAL_VAR usage
- Final dataset: ~1,000 filtered samples
Training Details
- Method: Supervised fine-tuning on corrected trajectories
- Batch size: 64
- Steps: 300
- Infrastructure: 48 H100 hours
- Library: prime-rl
Key Insight
Training focuses only on the root model's REPL interaction ability — learning to write code that decomposes problems and calls llm_query(). The sub-calls are handled by the base LM. "The major hurdle is learning to operate as the root model."
Benchmark Results
Tasks and Complexity
| Task | What it tests | Complexity | Input size |
|---|---|---|---|
| S-NIAH | Needle in haystack across scales | O(1) | 2^13 to 2^18 tokens |
| BrowseComp-Plus (1K docs) | Finding info across 1K web documents | O(|P|) | 6-11M tokens |
| OOLONG | Aggregation reasoning | O(|P|) | 131K tokens |
| OOLONG-Pairs | Pairwise comparison across entries | O(|P|²) | 32K tokens |
Results
| Task | GPT-5 RLM | Qwen3-480B RLM | Notes |
|---|---|---|---|
| BrowseComp-Plus | 91.3% | 44.7% | Beats RAG+BM25 by ~29% |
| OOLONG | 56.5% | 48.0% | Summary agents get only 0.1% on pairs variant |
| OOLONG-Pairs | 58.0% | 23.1% | Base models collapse; quadratic complexity |
Key Takeaway
RLM-Qwen3-8B (8B parameters, 48 H100 hours of training) approaches vanilla GPT-5 on 3 of 4 long-context tasks. Small model + recursive decomposition ≈ frontier model.
Comparison to Alternatives
| Approach | Strength | Weakness vs RLM |
|---|---|---|
| RAG / BM25 | Fast retrieval | Can't handle full-document reasoning; RLM beats by ~29% |
| Summarization agents | Good on single-fact lookup (70.5% BrowseComp) | Lossy — 0.1% on OOLONG-Pairs vs RLM's 58% |
| CodeAct + sub-calls | Can execute code | Lacks symbolic handle to prompt; can't scale beyond context |
| Longer context windows | Simple | O(L²) cost; still caps out; doesn't help at 6-11M tokens |
Cost: Median RLM run is cheaper than median base model run. However, tail-end cost can spike significantly (outlier trajectories with many recursive calls).
Limitations
- Sequential execution is slow —
llm_query()calls are blocking; no async/parallel sub-calls yet - FINAL() tagging is brittle — model sometimes outputs plans or thoughts as final answers
- Small models struggle — models without strong coding ability can't write effective REPL code
- Thinking models hit token limits — Qwen3-235B hit max output tokens on some tasks
- Variable quality across architectures — GPT-5 and Qwen3-Coder behave very differently as RLMs
- Max recursion depth = 1 — sub-calls are plain LMs, not RLMs; deeper recursion is future work
Connections
- Addresses the same long-context problem that DSA tackles from the architecture side — RLMs solve it from the inference/algorithm side
- Builds on Transformer O(L²) attention limitations as motivation
- Conceptually related to agentic approaches in GLM-5