Two startups this week unveiled AI models claiming dramatic efficiency improvements over current transformer architectures, with Zyphra’s ZAYA1-8B achieving competitive performance using only 760 million active parameters and Subquadratic’s SubQ model promising 1,000x efficiency gains through subquadratic scaling.
Zyphra’s ZAYA1-8B Delivers Competitive Performance on AMD Hardware
Palo Alto-based Zyphra released ZAYA1-8B, an 8-billion parameter mixture-of-experts (MoE) model with only 760 million active parameters. According to VentureBeat, the model achieves competitive performance against GPT-5-High and DeepSeek-V3.2 on third-party benchmarks despite using far fewer parameters than the trillions estimated for leading models.
The model was trained entirely on AMD Instinct MI300 GPUs, demonstrating that AMD’s hardware can serve as a viable alternative to NVIDIA’s dominant position in AI model development. ZAYA1-8B is available on Hugging Face under an Apache 2.0 license and can be tested through Zyphra Cloud.
Zyphra describes the model’s efficiency as “intelligence density” achieved through “full-stack innovation” spanning architecture, training techniques, and hardware optimization. The company’s approach follows a trend toward smaller, more efficient models as an alternative to the race for ever-larger parameter counts among major AI labs.
Subquadratic Claims 1,000x Efficiency with Controversial SubQ Architecture
Miami-based Subquadratic emerged from stealth claiming its SubQ 1M-Preview model is the first large language model built on a fully subquadratic architecture. According to VentureBeat, the company says its architecture reduces attention compute by almost 1,000 times compared to frontier models at 12 million tokens of context.
The startup raised $29 million in seed funding at a $500 million valuation from investors including Tinder co-founder Justin Mateen and former SoftBank Vision Fund partner Javier Villamizar. Subquadratic launched three products into private beta: an API, a coding agent called SubQ Code, and a search tool called SubQ Search.
However, the AI research community has responded with skepticism to Subquadratic’s claims. Researchers have demanded independent verification of the efficiency gains, with some comparing the extraordinary claims to previous technology frauds. The company has not yet provided peer-reviewed research or independent benchmarks to validate its architectural innovations.
Google Advances Gemma 4 with Multi-Token Prediction
Google released Multi-Token Prediction (MTP) drafters for its Gemma 4 family, achieving up to 3x speedup through speculative decoding. The technique pairs heavy target models with lightweight drafters to predict multiple future tokens simultaneously, addressing the memory-bandwidth bottlenecks that limit standard LLM inference.
According to Google’s testing on LiteRT-LM, MLX, Hugging Face Transformers, and vLLM, the MTP drafters deliver significant tokens-per-second improvements without degrading output quality or reasoning logic. The approach utilizes idle compute to generate token predictions faster than traditional sequential processing.
Speculative decoding represents a practical solution to the technical reality that LLM inference is memory-bandwidth bound, with processors spending most time moving parameters from VRAM to compute units for single token generation. Google’s implementation demonstrates how architectural innovations can improve efficiency without requiring new hardware.
Timer-XL Brings Foundation Model Approach to Time Series
Researchers from Tsinghua University’s THUML lab released Timer-XL, a decoder-only transformer foundation model for time-series forecasting that handles variable input and output lengths in a unified architecture. The model introduces TimeAttention, an attention mechanism designed specifically for long-context forecasting scenarios.
Timer-XL supports non-stationary univariate series, complex multivariate dynamics, and covariate-informed contexts with exogenous variables. Unlike models such as Tiny-Time-Mixers that require different versions for different input or output lengths, Timer-XL uses a single model for all cases without assumptions about context or prediction length.
The research builds on the team’s previous work including iTransformer, TimesNet, and the original Timer model. Timer-XL can be trained from scratch or pretrained on large datasets, with optional fine-tuning for improved performance on specific forecasting tasks.
What This Means
These developments highlight three distinct approaches to improving AI efficiency: architectural innovation through mixture-of-experts (Zyphra), fundamental mathematical scaling improvements (Subquadratic), and inference optimization through speculative decoding (Google). While Zyphra and Google have provided concrete benchmarks and open-source implementations, Subquadratic’s extraordinary claims await independent verification.
The success of ZAYA1-8B on AMD hardware demonstrates that competitive AI models can be developed outside NVIDIA’s ecosystem, potentially reducing hardware concentration risks as AI deployment scales. However, the mixed reception to Subquadratic’s claims underscores the importance of peer review and independent validation in AI research, particularly for breakthrough efficiency claims.
The emergence of specialized architectures for time-series forecasting with Timer-XL suggests that foundation model approaches are expanding beyond language tasks into domain-specific applications, potentially offering better performance than general-purpose models for specialized use cases.
FAQ
What makes ZAYA1-8B more efficient than larger models?
ZAYA1-8B uses a mixture-of-experts architecture with only 760 million active parameters out of 8 billion total, allowing it to achieve competitive performance while using significantly less compute during inference than models with trillions of active parameters.
Why are researchers skeptical of Subquadratic’s efficiency claims?
Subquadratic claims 1,000x efficiency improvements and fully subquadratic scaling, which would represent a fundamental breakthrough in AI architecture. However, the company has not provided peer-reviewed research or independent benchmarks to validate these extraordinary claims.
How does Google’s Multi-Token Prediction improve inference speed?
MTP uses speculative decoding with lightweight “drafter” models to predict multiple tokens simultaneously while a larger “target” model verifies the predictions, utilizing idle compute to overcome memory-bandwidth bottlenecks that limit traditional sequential token generation.






