Three research teams published distinct efficiency techniques this month that collectively attack the cost of building and running large AI systems from different angles — cutting pre-training time by up to 2.5x, multi-agent inference by 2.4x, and proposing a memory architecture that could eventually replace the KV cache entirely. The work spans Nous Research, University of Illinois Urbana-Champaign, Stanford, and independent researchers exploring post-transformer designs.
Token Superposition Training Slashes Pre-Training Time by 2.5x
Nous Research released Token Superposition Training (TST), a pre-training method that reduces wall-clock time at fixed compute without modifying model architecture, optimizer, tokenizer, parallelism strategy, or training data. The technique works by superimposing multiple token representations during training, allowing the model to process more information per gradient step.
According to the TST paper on arXiv, at the 10B-parameter mixture-of-experts scale, TST consumed 4,768 B200-GPU-hours to reach a lower final training loss than a matched-FLOPs baseline that required 12,311 GPU-hours — a roughly 2.5x reduction in total pre-training compute. The gains held across model sizes from 270M to 10B parameters, suggesting the method scales predictably rather than delivering one-off wins at a specific size.
The practical implication is significant: pre-training runs that currently take weeks on large GPU clusters could complete in roughly half the calendar time at the same hardware cost. Because TST requires no changes to the underlying architecture or data pipeline, teams could slot it into existing training infrastructure without redesigning their stack. Nous Research has not yet disclosed whether the method generalizes to dense transformers beyond mixture-of-experts configurations, which remains an open question for follow-up work.
RecursiveMAS Moves Multi-Agent Communication Into Embedding Space
Researchers at the University of Illinois Urbana-Champaign and Stanford University published RecursiveMAS, a framework that replaces text-based inter-agent communication with direct embedding-space transmission. Standard multi-agent systems generate natural language tokens to pass information between agents — a process that adds latency, inflates token costs, and makes end-to-end training difficult because the text bottleneck is non-differentiable.
RecursiveMAS removes that bottleneck. According to VentureBeat’s coverage, experiments show the framework delivers a 2.4x increase in inference speed and a 75% reduction in token usage compared to text-communicating baselines, while improving accuracy on code generation, medical reasoning, and search tasks.
The training cost advantage is also notable. RecursiveMAS is cheaper to train than standard full fine-tuning or LoRA methods, the researchers reported, because gradients can flow through the embedding connections rather than being blocked by discrete token sampling. That makes it feasible to train a coordinated multi-agent system as a single unit rather than optimizing each agent independently and hoping the interactions generalize.
Why Text-Based Agent Communication Is a Bottleneck
Current multi-agent pipelines face two distinct adaptation problems. Prompt-based adaptation — iteratively refining the shared context given to agents — improves coordination but leaves the underlying model weights unchanged, capping the system’s ceiling. Weight-based training updates the models themselves but is expensive and technically difficult when agents must communicate through discrete text, since backpropagating through token generation requires workarounds like reinforcement learning or straight-through estimators.
RecursiveMAS sidesteps both constraints by keeping communication entirely in continuous vector space, where standard gradient descent applies directly. The researchers describe it as a scalable blueprint for custom multi-agent systems, though independent replication of the benchmark results has not yet been reported.
The BDH Architecture: Memory in Weights, Not KV Cache
A separate line of research, surfaced in a Reddit Singularity thread, examines a post-transformer architecture proposed by researcher Jan Chorowski that relocates memory from the KV cache into the network weights themselves.
The core critique of the transformer’s memory model is what the thread author calls “anterograde amnesia”: transformers store pre-training knowledge in static weights and handle session context through a KV cache that grows linearly with sequence length. At long contexts, the KV cache becomes a memory and compute bottleneck — a well-documented scaling problem for production inference.
Chorowski’s proposed architecture, referred to as BDH, reframes attention by setting keys and queries equal to full neuron activations in high-dimensional space. Memory retrieval then becomes graph propagation through an accumulated connectivity matrix (sigma), rather than dot-product comparison of compressed key-query vectors. The thread author notes Chorowski’s explicit argument that linearizing attention — as standard state space models do — is insufficient on its own: “You cannot swap basically a non-linear attention layer for a linear attention layer and change nothing else in the model.”
BDH is exploratory and has not been independently benchmarked at scale. The framing is conceptually distinct from SSMs like Mamba, which trade attention’s quadratic complexity for linear recurrence but do not fundamentally rethink where long-term memory resides.
OpenAI’s Parameter Golf: Efficiency Under Extreme Constraints
OpenAI ran a public machine learning challenge called Parameter Golf that produced a different kind of efficiency signal. Participants had to minimize held-out loss on a fixed FineWeb dataset while staying within a 16 MB artifact limit — covering both model weights and training code — with a 10-minute training budget on 8×H100s.
According to OpenAI’s blog post, the contest drew more than 1,000 participants and over 2,000 submissions across eight weeks. Winning approaches included careful optimizer tuning, aggressive quantization, and test-time training — techniques that squeeze performance from severely constrained budgets rather than scaling up.
OpenAI noted that AI coding agents played a meaningful role in the competition, lowering the barrier to experimentation and accelerating iteration cycles. The contest also functioned as a talent discovery surface, with OpenAI explicitly stating that identifying exceptional machine learning researchers was one of its goals. The 16 MB constraint forced participants to think about parameter efficiency in ways that large-scale training rarely demands, producing insights applicable to edge deployment and on-device inference.
What This Means
Taken together, these four research threads point toward a field actively attacking AI’s cost structure from multiple directions simultaneously — and making measurable progress on all of them.
TST and RecursiveMAS are the most immediately deployable: both report concrete speedup numbers on real hardware, and neither requires architectural changes that would demand retraining existing models from scratch. TST targets the pre-training phase, where GPU costs are largest in absolute terms. RecursiveMAS targets inference, where token costs accumulate continuously in production.
The BDH architecture and Parameter Golf represent longer time horizons. BDH, if it scales, would address a structural limitation of the transformer — the KV cache’s linear memory growth — rather than optimizing around it. Parameter Golf’s 16 MB constraint is artificial, but the techniques it surfaced (aggressive quantization, test-time training, optimizer tuning) are directly relevant to edge AI deployment, a market that NVIDIA, Qualcomm, and Apple are all competing to own.
The common thread is that efficiency is no longer a secondary concern addressed after capability. Researchers are now designing training methods, communication protocols, and memory architectures with cost as a first-order objective — a shift that reflects both the maturation of the field and the practical limits of scaling compute indefinitely.
Cerebras’ IPO this week at a $100 billion market cap — built on a chip optimized specifically for fast inference — underscores that the market has already priced in the assumption that inference efficiency will be a durable competitive differentiator, not a solved problem.
FAQ
What is Token Superposition Training?
Token Superposition Training (TST) is a pre-training method from Nous Research that reduces training wall-clock time by up to 2.5x at fixed compute. It works by superimposing multiple token representations during training without changing the model architecture, optimizer, or data pipeline.
How does RecursiveMAS differ from standard multi-agent AI systems?
Standard multi-agent systems pass information between agents as generated text, which adds latency and blocks end-to-end gradient flow. RecursiveMAS routes inter-agent communication through embedding space instead, enabling joint training and delivering a reported 2.4x inference speedup and 75% token reduction.
What is the KV cache and why is it a problem for LLMs?
The KV cache stores key-value pairs from previous tokens so the model does not recompute attention from scratch at each step. It grows linearly with sequence length, consuming increasing GPU memory and slowing inference at long contexts — a scaling bottleneck that architectures like BDH aim to address by moving memory into network weights instead.
Sources
- How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75% – VentureBeat
- What Parameter Golf taught us about AI-assisted research – OpenAI Blog
- Cerebras stock nearly doubles on day one as AI chipmaker hits $100 billion — what it means for AI infrastructure – VentureBeat
- The interesting BDH question: What if LLM memory lived in the network weights instead of the ever-growing KV cache? – Reddit Singularity
- Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models – Reddit Singularity






