Digital Mind News | OpenAI: Chain-of-Thought Reasoning: Advances and Hidden Flaws

Chain-of-thought reasoning has become one of the most studied techniques in AI development, powering models like DeepSeek-R1 and OpenAI’s o1 series — but a cluster of new research papers published in May 2026 reveals both meaningful progress and structural problems that engineers and evaluators need to understand. From position bias that grows with reasoning length, to new frameworks for knowing when to stop iterating, the field is maturing past simple benchmarks into harder questions about reliability and design.

What Chain-of-Thought Reasoning Actually Does

Chain-of-thought (CoT) reasoning is a prompting and training technique that instructs a model to produce intermediate reasoning steps before delivering a final answer. Rather than mapping a question directly to an output, the model generates a “trajectory” — a sequence of logical steps — that resembles how a human might work through a problem on paper.

According to TechCrunch’s AI glossary, reasoning-tuned models are part of a broader push toward systems that can handle multi-step tasks autonomously. The assumption underlying most CoT deployments is that more thinking produces better, less biased answers. That assumption is now under scrutiny.

Towards Data Science describes CoT as one of several key concepts LLM engineers must internalize alongside tokenization, attention mechanisms, and fine-tuning strategies. In practice, CoT is implemented at both the prompting layer — by instructing models to “think step by step” — and at the training layer, where models are fine-tuned on reasoning traces using reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO).

The technique has produced measurable gains on math and logic benchmarks, which is why it became the backbone of OpenAI’s o1 and o3 models and the open-weight DeepSeek-R1 family.

Longer Reasoning Trajectories Amplify Position Bias

A paper published on arXiv (arXiv:2605.06672) directly challenges the assumption that more reasoning equals more reliable reasoning. The study tested thirteen reasoning-mode configurations — including two R1-distilled 7–8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B parameters — across the MMLU, ARC-Challenge, and GPQA benchmarks.

The core finding: within any reasoning-capable model, position bias in multiple-choice QA scales with the length of the reasoning trajectory. Twelve of the thirteen configurations showed a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, with correlations ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations showed monotonically increasing PBS across length quartiles.

The researchers also ran a truncation intervention, resuming model completions from later points in the reasoning trajectory. Truncated continuations were increasingly likely to shift toward position-preferred answer options — shifting from 16% to 32% for R1-Qwen-7B across absolute-position buckets. This provides causal evidence that longer reasoning doesn’t just correlate with bias; it accumulates it.

At 671B parameters, aggregate PBS collapsed to 0.019, but the length effect still appeared in the longest quartile (PBS = 0.071), suggesting that model scale gates the expression of bias rather than eliminating the underlying mechanism.

The paper also found that direct-answer position bias — observed strongly in Llama-Instruct-direct but weakly in Qwen-Instruct-direct — is a distinct phenomenon uncorrelated with trajectory length. CoT reasoning, the authors argue, replaces baseline positional bias with a different form: length-accumulated bias.

A New Framework for Knowing When to Stop

A separate arXiv paper (arXiv:2605.06690) addresses a design problem that most recursive reasoning systems leave implicit: when should a model stop iterating?

The researchers propose representing the reasoning state as an epistemic state graph — a structured encoding of extracted claims, evidential relations, open questions, and confidence weights. This formalization makes the reasoning state inspectable rather than opaque.

To determine when further iteration is unlikely to help, the paper introduces the order-gap: the distance between the states reached by expand-then-consolidate versus consolidate-then-expand orderings. When the order-gap is small, the two orderings agree, signaling that additional reasoning passes are unlikely to change the conclusion.

The paper’s main theoretical result gives a necessary and sufficient condition for the linearized order-gap to be non-degenerate near the fixed point — establishing when the stopping criterion is informative rather than algebraically vacuous. The authors are careful to note this is a local condition, not a global convergence guarantee.

The framework is sketched for application across several reasoning paradigms:

Agent loops — iterative tool-use and planning cycles
Tree-of-thought reasoning — branching search over reasoning paths
Theorem proving — formal verification with incremental proof steps
Continual learning — updating beliefs as new evidence arrives

This kind of principled termination criterion matters practically: without it, reasoning systems either over-iterate (wasting compute and potentially accumulating bias) or under-iterate (missing important inferences).

Verifier-Guided Action Selection for Embodied Agents

Chain-of-thought reasoning doesn’t only apply to text QA — it’s also central to embodied AI agents that must plan and act in physical or simulated environments. A third paper (arXiv:2605.12620) proposes Verifier-Guided Action Selection (VeGAS), a test-time framework designed to improve the robustness of multimodal LLM-based agents.

The core mechanism: rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions at inference time and uses a generative verifier to identify the most reliable choice — without modifying the underlying policy model.

A key finding is that using an off-the-shelf MLLM as the verifier yields no improvement. The researchers instead built an LLM-driven data synthesis strategy that automatically constructs a curriculum of failure cases, exposing the verifier to a rich distribution of potential errors during training.

Across benchmarks in the Habitat and ALFRED environments, VeGAS achieved up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks. The gains were most pronounced in out-of-distribution scenarios — exactly the conditions where CoT-only agents tend to fail.

This suggests that CoT reasoning, while necessary, is not sufficient for robust embodied planning. An explicit verification layer trained on failure modes provides meaningful additional reliability.

Building LLM Systems That Reason Well in Practice

For engineers building on top of reasoning-capable models, Towards Data Science outlines the practical stack these systems sit within: tokenization converts text to numerical representations, transformer attention mechanisms allow models to relate tokens across context, and training strategies like RLHF shape reasoning behavior. Evaluation, the piece emphasizes, is where many teams underinvest — and the position bias findings above illustrate why that matters.

The TechCrunch glossary notes that AI agents — systems that use models to perform multi-step tasks autonomously — are increasingly built on CoT-capable models. The infrastructure for these systems is still being established, and design choices about state representation and termination (as addressed in the epistemic state graph paper) are not yet standardized across the industry.

Key engineering considerations when deploying reasoning models:

Benchmark evaluation should control for answer position, not just accuracy, particularly in multiple-choice formats
Trajectory length should be monitored as a proxy for bias risk, not just as a compute cost
Termination criteria need explicit design — defaulting to fixed token budgets ignores whether additional reasoning is actually productive
Verification layers trained on failure distributions outperform generic verifiers when used with embodied or agentic systems

What This Means

The research published this month collectively reframes how the field should think about chain-of-thought reasoning. CoT is not a bias-reduction technique — it’s a bias-substitution technique. It trades the shallow heuristic biases of direct-answer models for a different failure mode: position bias that accumulates as reasoning gets longer. At smaller model scales (7–8B parameters), this is a significant and measurable problem. At 671B, scale suppresses the aggregate effect but doesn’t eliminate it.

The practical implication for evaluation pipelines is immediate: multiple-choice benchmarks that don’t control for answer position are producing unreliable comparisons between reasoning models. Teams running MMLU or ARC-Challenge evaluations should add position randomization and PBS measurement to their standard tooling.

For system designers, the epistemic state graph framework and the VeGAS verifier approach point toward a more modular architecture for reasoning systems — one where the reasoning process, its termination, and the verification of its outputs are treated as separate, inspectable components rather than a single black-box generation pass.

The 36% gain from VeGAS on long-horizon embodied tasks is a concrete signal that verification trained on failure cases is worth the engineering investment, at least in agentic settings. Whether similar gains transfer to pure language reasoning tasks remains an open question the research doesn’t yet answer.

FAQ

What is chain-of-thought reasoning in LLMs?

Chain-of-thought reasoning is a technique where a language model generates intermediate reasoning steps — a visible “thought process” — before producing a final answer. It is used both as a prompting strategy and as a training objective, and underpins models like OpenAI’s o1 and DeepSeek-R1.

Does more reasoning always produce more accurate and less biased results?

Not necessarily. According to arXiv:2605.06672, longer reasoning trajectories are positively correlated with increased position bias in multiple-choice QA across twelve of thirteen tested configurations, after controlling for accuracy. More thinking can accumulate bias rather than reduce it, particularly in smaller models.

What is position bias in AI reasoning models?

Position bias refers to a model’s tendency to favor answer options based on their position in a list — for example, consistently preferring option A or option C — rather than their content. The arXiv study found that this bias grows with reasoning trajectory length in CoT-capable models and is distinct from the positional bias seen in direct-answer models.

Sources

The Must-Know Topics for an LLM Engineer – Towards Data Science
More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models – arXiv AI
So you’ve heard these AI terms and nodded along; let’s fix that – TechCrunch
State Representation and Termination for Recursive Reasoning Systems – arXiv AI
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents – arXiv AI