Three separate research efforts published in May 2026 have each cut AI training or inference costs by 2x or more — without requiring new hardware. The techniques span multi-agent communication, pre-training efficiency, and video understanding, and together they signal a broad push to extract more performance from existing compute budgets.
RecursiveMAS Cuts Token Usage by 75% in Multi-Agent Systems
Researchers at the University of Illinois Urbana-Champaign and Stanford University published RecursiveMAS, a framework that replaces text-based communication between AI agents with direct embedding-space transmission. According to VentureBeat, the change produces a 2.4x increase in inference speed and a 75% reduction in token usage compared to standard multi-agent pipelines.
The core problem RecursiveMAS addresses is structural. In conventional multi-agent systems, each agent generates text, passes it to the next agent, which reads and re-encodes it — a loop that accumulates latency and token costs at every step. By transmitting raw embeddings instead, agents skip the decode-then-re-encode cycle entirely.
Performance gains held across three distinct domains: code generation, medical reasoning, and search tasks. The framework also proved cheaper to train than either full fine-tuning or LoRA, which matters for teams building custom multi-agent deployments where repeated retraining is expected.
The approach also addresses a longer-standing limitation of prompt-based multi-agent adaptation. Updating shared context through prompts can guide agent behavior, but leaves underlying model weights unchanged. RecursiveMAS enables end-to-end weight updates across the full agent network, making the system trainable as a single unit rather than a collection of independent models.
Nous Research’s Token Superposition Cuts Pre-Training Time by 2.5x
Nous Research released Token Superposition Training (TST), a pre-training method that reduces wall-clock training time at fixed compute without modifying model architecture, optimizer, tokenizer, parallelism strategy, or training data. The paper is available on arXiv.
The efficiency gains are most pronounced at scale. At the 10B-parameter mixture-of-experts level with a 1B active parameter configuration, TST consumed 4,768 B200-GPU-hours to reach a given training loss — versus 12,311 GPU-hours for a matched-FLOPs baseline. That is roughly a 2.5x reduction in total pre-training time. TST also reached a lower final training loss than the baseline, meaning the speed improvement did not come at the cost of model quality.
The method has been validated across models ranging from 270M to 10B parameters, suggesting the gains are not specific to a single scale. Because TST requires no architectural changes, it can in principle be dropped into existing training pipelines without redesigning the model or rewriting infrastructure.
For organizations running large pre-training runs, even a 2x reduction in GPU-hours translates directly into cost savings at current cloud compute prices. The Nous Research release positions TST as a practical tool rather than a research curiosity, with public documentation available on the company’s site.
OpenAI’s Parameter Golf Surfaces Novel Training Techniques
OpenAI ran an eight-week open challenge called Parameter Golf, asking participants to minimize held-out loss on a fixed FineWeb dataset while keeping the entire artifact — model weights plus training code — under 16 MB. Training was capped at 10 minutes on 8×H100s.
According to OpenAI’s blog post, the challenge drew more than 1,000 participants and over 2,000 submissions. Techniques ranged from careful optimizer tuning and quantization to novel modeling ideas and test-time training approaches.
One notable finding from the competition was the widespread use of AI coding agents by participants. OpenAI noted that agents lowered the cost of running experiments and broadened participation, but also created new complications around submission review, attribution, and scoring.
What the Submissions Revealed
The constraint of 16 MB forced participants to be precise about every architectural and training decision — a pressure that surfaced techniques that might not emerge in unconstrained settings. OpenAI described the challenge as a talent discovery surface, noting that tightly scoped technical problems can reveal “machine learning taste and persistence” in ways that standard hiring processes may not.
The competition did not produce a single dominant technique, but the breadth of approaches — quantization, optimizer tuning, test-time training, and new modeling ideas — reflects how many levers researchers are currently pulling to improve training efficiency.
Perceptron Mk1 Prices Video AI at 80–90% Below Rivals
Startup Perceptron Inc. released Mk1, a proprietary video analysis reasoning model priced at $0.15 per million input tokens and $1.50 per million output tokens via API. According to VentureBeat, that pricing sits 80–90% below comparable offerings from Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro.
Co-founder and CEO Armen Aghajanyan, formerly of Meta FAIR and Microsoft, told VentureBeat the company spent 16 months building a “multi-modal recipe” from scratch to handle the physical-world complexity of live video — including cause-and-effect reasoning, object dynamics, and physics-based inference.
The model targets enterprise use cases: security monitoring, marketing video analysis, body language assessment in controlled studies, and inconsistency detection in recorded content. A public demo is available.
The pricing strategy is notable because video analysis has historically been one of the more expensive multimodal inference tasks, partly due to the token volume involved in processing frame sequences. Mk1’s cost structure, if it holds at scale, removes a significant barrier for enterprise deployments that require continuous or high-volume video processing.
What This Means
These four developments — RecursiveMAS, Token Superposition Training, Parameter Golf’s findings, and Perceptron Mk1 — share a common thread: efficiency gains achieved through training methodology and system design rather than raw hardware scaling.
The 2.4x–2.5x speedups from RecursiveMAS and TST are particularly significant because they operate at different points in the model lifecycle. TST compresses pre-training time; RecursiveMAS compresses inference cost for deployed multi-agent systems. Both improvements compound: a model trained faster with TST, deployed in a RecursiveMAS-style system, could be substantially cheaper end-to-end than current approaches.
Parameter Golf’s results add a third dimension. The competition showed that tight resource constraints force researchers toward creative solutions that might otherwise go unexplored. OpenAI’s decision to run the challenge as an open contest rather than internal research suggests the lab sees external community experimentation as a meaningful source of architectural insight.
Perceptron’s pricing move reflects a separate but related pressure: as foundation model inference costs fall, the competitive moat for proprietary video AI narrows. An 80–90% price reduction, if matched by adequate accuracy, could accelerate enterprise adoption of video AI from a niche capability to a standard infrastructure component.
Collectively, these advances suggest the current phase of AI development is as much about squeezing efficiency from existing architectures as it is about building larger ones.
FAQ
What is Token Superposition Training?
Token Superposition Training (TST) is a pre-training method from Nous Research that reduces the wall-clock time required to train large language models by up to 2.5x at fixed compute. It requires no changes to model architecture, optimizer, or training data, and has been validated on models from 270M to 10B parameters.
How does RecursiveMAS reduce token usage?
RecursiveMAS replaces text-based message passing between AI agents with direct embedding-space communication, eliminating the repeated decode-and-re-encode cycles that generate token overhead in standard multi-agent systems. In experiments by researchers at UIUC and Stanford, this cut token usage by 75% and increased inference speed by 2.4x.
What was OpenAI’s Parameter Golf challenge?
Parameter Golf was an eight-week open machine learning competition where participants minimized held-out loss on the FineWeb dataset while keeping their entire submission — model weights and training code — under 16 MB, with a 10-minute training cap on 8×H100s. OpenAI received over 2,000 submissions from more than 1,000 participants and used the results to identify novel training techniques and engineering talent.
Sources
- How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75% – VentureBeat
- What Parameter Golf taught us about AI-assisted research – OpenAI Blog
- Cerebras stock nearly doubles on day one as AI chipmaker hits $100 billion — what it means for AI infrastructure – VentureBeat
- Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models – Reddit Singularity
- Perceptron Mk1 shocks with highly performant video analysis AI model 80-90% cheaper than Anthropic, OpenAI & Google – VentureBeat






