Reasoning Models Show Promise But Face Bias, Cost Challenges

AI reasoning models that use chain-of-thought processing are advancing rapidly, but new research reveals significant limitations in creativity, position bias, and computational costs that could impact their real-world deployment.

Creative Problem-Solving Remains Limited

Despite strong performance on traditional benchmarks, large language models struggle with creative tool use and non-obvious problem-solving. According to research published on arXiv, scientists created CreativityBench to evaluate “affordance-based creativity” — the ability to repurpose objects based on their physical properties rather than conventional uses.

The benchmark includes 4,000 entities and over 150,000 affordance annotations, generating 14,000 tasks that require identifying unconventional but physically plausible solutions. Evaluations across 10 state-of-the-art LLMs showed models could often select appropriate objects but failed to identify correct parts, their affordances, and underlying physical mechanisms.

Key findings from the creativity research:

Performance improvements from model scaling quickly saturate
Strong general reasoning doesn’t translate to creative affordance discovery
Chain-of-thought strategies yield limited gains on creative tasks
Current models lack the “missing dimension of intelligence” needed for creative tool use

Longer Reasoning Increases Position Bias

A separate study on reasoning bias found that extended chain-of-thought processing actually amplifies position bias in multiple-choice questions. Researchers tested 13 reasoning-mode configurations across models including DeepSeek-R1 and found that 12 showed positive correlation between reasoning trajectory length and position bias.

The research examined models on MMLU, ARC-Challenge, and GPQA benchmarks. Position Bias Scores ranged from 0.11 to 0.41 across different model configurations, with all open-weight reasoning models showing monotonically increasing bias across length quartiles.

Truncation experiments provided causal evidence:

Continuations from later trajectory points shifted toward position-preferred options
Bias increased from 16% to 32% for R1-Qwen-7B across position buckets
Even DeepSeek-R1 at 671B parameters showed length-driven bias in longest quartiles

Inference Scaling Drives Up Compute Costs

Reasoning models’ extended processing comes with substantial cost implications for production deployments. Analysis from Towards Data Science shows that inference scaling — where models spend extra compute on each response — transforms model selection into “high stakes operations tradeoffs.”

Reasoning models generate hidden reasoning tokens that never appear in final responses but represent massive increases in billable compute. Teams must balance the Cost-Quality-Latency triangle, where finance teams monitor shrinking margins, infrastructure engineers manage latency, and product managers weigh answer quality against processing delays.

Production considerations include:

Hidden reasoning tokens increase monthly compute invoices
P95 latency can reach 30+ seconds for complex reasoning
Task taxonomy needed to route simple queries to efficient models
Risk assessment for reasoning bypassing safety guardrails

AMD Hardware Proves Viable for Training

Zyphra’s release of ZAYA1-8B demonstrates that AMD Instinct MI300 GPUs can successfully train competitive reasoning models. According to VentureBeat, the 8-billion parameter mixture-of-experts model with only 760 million active parameters matches performance against larger models on third-party benchmarks.

ZYPHRA1-8B was trained entirely on AMD’s MI300 GPU stack, showing the platform as a viable alternative to NVIDIA’s dominant position. The model is available under Apache 2.0 license on Hugging Face and can be tested through Zyphra Cloud’s inference platform.

Technical specifications:

8 billion total parameters, 760 million active
Mixture-of-experts architecture
Competitive performance vs GPT-5-High and DeepSeek-V3.2
Full-stack training on AMD Instinct MI300 GPUs

Engineering Complexity Increases

The shift toward reasoning models adds layers of complexity for LLM engineers. Technical analysis shows practitioners must now understand not just transformer architectures but also inference optimization, evaluation pitfalls, and system-level considerations for reasoning-capable models.

Key engineering domains now include tokenization strategies, attention mechanisms, fine-tuning approaches, and specialized evaluation methods for reasoning tasks. The transition from computer vision to LLMs requires understanding training trade-offs, inference bottlenecks, alignment challenges, and practical considerations like prompt engineering.

Critical engineering areas:

Text-to-vector conversion and tokenization
Model architecture design and training strategies
Inference optimization for extended reasoning
Evaluation frameworks for reasoning capabilities
System-level deployment considerations

What This Means

The research reveals a complex picture for reasoning models in 2025. While these systems show impressive capabilities on standard benchmarks, they face fundamental limitations in creative problem-solving and introduce new forms of bias that scale with reasoning length.

For enterprises, the cost implications are significant. Reasoning models can dramatically increase compute bills through hidden token generation, requiring careful task routing and cost management strategies. The success of AMD-trained models like ZAYA1-8B suggests more hardware options for organizations looking to develop reasoning capabilities.

The findings suggest that current reasoning models may not be as robust or cost-effective as initially hoped, particularly for creative or open-ended tasks where their limitations are most apparent.

FAQ

How much do reasoning models increase compute costs?
Reasoning models generate hidden reasoning tokens that can multiply compute costs several times over standard models, with exact increases depending on reasoning complexity and trajectory length.

Are reasoning models actually better at creative problem-solving?
No — research shows reasoning models struggle with creative tool use and affordance-based problem-solving, often failing to identify non-obvious but physically plausible solutions despite strong performance on traditional benchmarks.

Do longer reasoning chains improve accuracy?
While longer reasoning can improve accuracy on some tasks, it also increases position bias in multiple-choice questions, creating a trade-off between thoroughness and systematic bias that affects model reliability.