AI Reasoning Models Show Length-Driven Position Bias Despite

New research reveals that chain-of-thought reasoning models exhibit increased position bias as their thinking processes grow longer, challenging assumptions that more deliberate reasoning reduces cognitive shortcuts. A study published this week on arXiv found that across 13 reasoning-capable AI configurations, 12 showed positive correlation between trajectory length and position bias scores, ranging from 0.11 to 0.41.

According to the research, models like DeepSeek-R1 and other reasoning-tuned systems demonstrate “length-driven position bias” where longer reasoning chains paradoxically lead to more biased decision-making in multiple-choice questions. The findings contradict the common belief that extended deliberation reduces shallow heuristic biases.

Chain-of-Thought Reasoning Under Scrutiny

The study examined reasoning models across MMLU, ARC-Challenge, and GPQA benchmarks, testing everything from 7-8B parameter models to DeepSeek-R1’s 671B parameters. Researchers found that truncation experiments provided causal evidence: when reasoning was resumed from later points in trajectories, models became 16% to 32% more likely to shift toward position-preferred options.

Key findings include:

All 12 open-weight reasoning configurations showed monotonically increasing position bias across length quartiles
Direct-answer position bias operates as a distinct phenomenon with different patterns
Chain-of-thought reasoning replaces baseline bias with accumulated bias over longer sequences

The research suggests that reasoning-capable models “should not be treated as order-robust by default in MCQ evaluation pipelines,” according to the study authors. At 671B parameters, aggregate position bias collapsed to 0.019, but length effects persisted in the longest quartile at 0.071.

Creative Reasoning Challenges Persist

Separate research introduced CreativityBench, a new benchmark evaluating affordance-based creativity in large language models. The benchmark tests whether AI systems can repurpose available objects by reasoning about their physical properties rather than relying on conventional usage patterns.

The CreativityBench study built a knowledge base with 4,000 entities and over 150,000 affordance annotations, generating 14,000 grounded tasks requiring non-obvious yet physically plausible solutions. Results across 10 state-of-the-art models showed significant performance gaps in creative tool use.

Models frequently selected plausible objects but failed to identify correct parts, their affordances, and underlying physical mechanisms needed to solve tasks. The research found that model scaling improvements quickly saturated and that strong general reasoning abilities don’t reliably translate to creative affordance discovery.

Efficient Reasoning Models Emerge

While major labs pursue ever-larger models, Palo Alto startup Zyphra released ZAYA1-8B, a mixture-of-experts reasoning model with just 8 billion parameters and only 760 million active parameters. According to VentureBeat, the model maintains competitive performance against much larger systems while being trained entirely on AMD Instinct MI300 GPUs.

ZYRA1-8B demonstrates “intelligence density” through what Zyphra calls a “full-stack innovation” approach. The model is available under Apache 2.0 licensing and can be downloaded from Hugging Face immediately. Individual users can test the model through Zyphra Cloud’s inference platform.

The development proves AMD’s MI300 platform can produce viable alternatives to NVIDIA-trained models, potentially disrupting the GPU market’s current dynamics. This represents a significant validation for AMD’s position in AI infrastructure competition.

Position Bias Diagnostic Tools

Researchers developed a diagnostic toolkit for auditing position bias in reasoning models, including Position Bias Score (PBS), commitment change point analysis, effective switching metrics, and truncation probes. These tools enable systematic evaluation of how reasoning length affects model reliability.

The study found that accuracy can gate the expression of length-driven bias rather than eliminating underlying mechanisms. Even highly capable models like DeepSeek-R1 at 671B parameters showed residual bias effects in their longest reasoning trajectories, suggesting fundamental limitations in current reasoning approaches.

Implications for model evaluation:

Multiple-choice question evaluation pipelines need bias auditing
Longer reasoning doesn’t guarantee better decision-making
Position bias scales predictably with trajectory length
Truncation experiments can reveal causal bias mechanisms

What This Means

These findings challenge core assumptions about AI reasoning capabilities and evaluation methods. The discovery that longer chain-of-thought processes increase rather than decrease certain biases suggests current reasoning paradigms may have fundamental limitations.

For AI developers, the research highlights the need for more sophisticated bias detection and mitigation strategies. Position bias auditing should become standard practice in model evaluation, particularly for reasoning-capable systems used in high-stakes applications.

The emergence of efficient models like ZAYA1-8B demonstrates that reasoning capabilities don’t require massive parameter counts, potentially democratizing access to advanced AI reasoning. However, creative problem-solving remains a significant challenge across all model sizes, indicating important gaps in current approaches to artificial general intelligence.

The combination of bias persistence and creative reasoning limitations suggests that while AI systems excel at pattern recognition and structured reasoning, they still struggle with the flexible, context-aware thinking that characterizes human intelligence.

FAQ

What is position bias in AI reasoning models?
Position bias occurs when AI models systematically favor certain answer positions in multiple-choice questions regardless of content. The research found this bias increases as reasoning chains get longer, contradicting assumptions that more deliberate thinking reduces such shortcuts.

How does chain-of-thought reasoning affect model bias?
Contrary to expectations, longer chain-of-thought reasoning trajectories correlate with increased position bias. Models become 16-32% more likely to shift toward position-preferred options when reasoning is resumed from later points in their thinking process.

What makes ZAYA1-8B different from larger reasoning models?
ZYRA1-8B achieves competitive reasoning performance with only 8 billion parameters and 760 million active parameters, compared to trillions in larger models. It was trained entirely on AMD GPUs and is available under permissive open-source licensing, making advanced reasoning more accessible.