Recursive Language Models Win Long-Context Benchmarks

Recursive Language Models (RLMs) are outperforming conventional agentic architectures on long-context benchmarks, according to a detailed technical breakdown published May 16, 2026 by Avishek Biswas in Towards Data Science. Separately, two arXiv preprints published the same week propose distinct methods for making LLM reasoning more robust and more principled — one through ensemble-based action verification, the other through a data-preprocessing layer that restructures relational information before training.

What Recursive Language Models Actually Do

Most agentic LLM frameworks — ReAct, CodeAct, vanilla subagents — share a structural weakness: they replicate context at every step rather than passing it by reference. According to Biswas, this is the single missing piece that causes these architectures to degrade on long, multi-step tasks. RLMs address it by treating context as a pointer rather than a payload, allowing the model to recurse through subtasks without ballooning the effective token window at each call.

Biswas illustrates the failure mode with a deliberately simple experiment: ask a standard agentic setup to generate 50 fruit names and count the letter “R” in each, returning the result as a dictionary. Then scale that to a nested version — fruits, countries, and animals, 50 entries each. Standard frameworks begin making counting errors and structural mistakes as the task grows. The RLM, by contrast, maintains accuracy because it doesn’t need to re-encode the full prior context at every recursive step.

Biswas spent roughly a month implementing RLMs, running benchmarks, and producing a 50-minute tutorial video on the architecture. He fielded more than 100 questions on YouTube and X during that period, and the Towards Data Science article is a synthesis of the nuances that emerged from those exchanges.

VeGAS: Verification Before Action

A separate line of research, published as arXiv:2605.12620, targets a different failure mode: embodied agents that commit too quickly to a single decoded action. The paper introduces Verifier-Guided Action Selection (VeGAS), a test-time framework that samples an ensemble of candidate actions and uses a generative verifier model to select the most reliable one — without modifying the underlying policy.

The key finding is that using an off-the-shelf MLLM as a verifier produces no improvement on its own. The gains come from a purpose-built training curriculum: an LLM-driven data synthesis pipeline that automatically generates diverse failure cases, exposing the verifier to a wide distribution of potential errors before deployment.

Across benchmarks in the Habitat and ALFRED embodied reasoning environments, VeGAS achieved up to a 36% relative performance gain over strong chain-of-thought baselines on the most demanding multi-object, long-horizon tasks. The authors frame this as evidence that verification quality, not verification presence, is what determines downstream robustness.

Principled Reasoning via Relational Recoding

A third paper, arXiv:2605.14036, takes a more foundational approach. Its authors argue that current LLMs produce fluent prose reliably but lack any principled basis for trusting the content of that prose — and that the conventional assumption that principled reasoning is computationally unaffordable is incorrect.

Their proposed fix is a two-stage pipeline:

Stage 1 (preprocessing): Input data is recoded into a “Unary Relational Integracode” — a representation that makes relationships between objects in the text explicit, rather than leaving them distributed across multiple references.
Stage 2 (training): A standard (or streamlined) machine learning process then learns to predict these explicit relationships, in addition to standard next-token objectives.

The authors describe this as realizing a world model that extends beyond natural language to vision and action domains. They connect the approach to Robust Logic, a framework for principled chaining on learned, uncertain information. A notable theoretical claim in the paper: the recoding has the property of making a core subset of relational rules polynomial-time learnable, with the polynomial depending on rule complexity. If that result holds under scrutiny, it would provide formal support for sound reasoning both within and across model calls.

How These Approaches Relate

All three research directions are responding to the same underlying problem: LLMs that generate plausible-sounding outputs without reliable internal verification of correctness. They differ in where they intervene.

RLMs intervene at the architectural level — changing how context flows between reasoning steps. VeGAS intervenes at inference time — adding a verification pass over candidate actions without touching the base policy. The relational recoding approach intervenes at training time — restructuring the data itself so the model learns to represent and predict relationships explicitly.

None of these approaches are mutually exclusive. An RLM-style architecture could, in principle, incorporate a VeGAS-style verifier at each recursive step, while both could benefit from a base model trained on relationally recoded data. The field has not yet produced a unified framework that combines all three, but the convergence of these research threads in the same week suggests the problem of reasoning reliability is receiving sustained, parallel attention.

Chain-of-thought reasoning — the technique of prompting models to show intermediate steps — underpins all three approaches to varying degrees. VeGAS explicitly extends CoT by adding a verification layer on top of it. RLMs restructure how CoT-style intermediate outputs are passed between calls. The relational recoding paper proposes replacing some of what CoT does implicitly (tracking object relationships) with an explicit preprocessing step.

What This Means

The practical implication of these three papers, taken together, is that reasoning improvements are moving away from prompting tricks and toward architectural and training-level interventions. The 36% gain VeGAS reports on hard embodied tasks is not a marginal improvement — it suggests that the verification step is doing substantial work that the base policy cannot do alone.

For developers building agentic systems today, the RLM findings are the most immediately actionable. The core insight — pass context by reference, not by value — is an architectural decision that can be made without retraining a base model. The gains on long-context benchmarks suggest this matters most for tasks that require many sequential reasoning steps over large information spaces.

The relational recoding paper is further from deployment but potentially more significant in the long run. If the polynomial-time learnability result is validated, it would provide a theoretical foundation for trusting model outputs on relational reasoning tasks — something the field currently lacks.

FAQ

What are Recursive Language Models?

Recursive Language Models are an agentic LLM architecture that passes context between reasoning steps by reference rather than replicating it at each call. According to Avishek Biswas’s Towards Data Science analysis, this design allows RLMs to maintain accuracy on long, multi-step tasks where frameworks like ReAct and CodeAct degrade.

How does VeGAS improve chain-of-thought reasoning?

VeGAS adds an explicit verification step at inference time: instead of committing to a single decoded action, the framework samples multiple candidate actions and uses a trained verifier model to select the most reliable one. The arXiv:2605.12620 paper reports up to a 36% relative performance gain over CoT baselines on hard embodied reasoning tasks in Habitat and ALFRED environments.

What is the difference between reasoning at inference time versus training time?

Inference-time reasoning improvements — like VeGAS’s verification step or chain-of-thought prompting — modify how a model behaves when generating outputs, without changing the model’s weights. Training-time approaches, like the relational recoding method proposed in arXiv:2605.14036, restructure the data or objectives the model learns from, aiming to build more reliable reasoning into the model’s parameters directly.

Sources

Recursive Language Models: An All-in-One Deep Dive – Towards Data Science
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents – arXiv AI
Enhanced and Efficient Reasoning in Large Learning Models – arXiv AI
Fostering breakthrough AI innovation through customer-back engineering – MIT Technology Review
How companies weaponize the terms of service against you – The Verge