pith. machine review for the scientific record. sign in

arxiv: 2605.06672 · v1 · submitted 2026-04-21 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:20 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords position biaschain-of-thought reasoningreasoning modelsmultiple-choice questionstrajectory lengthlanguage modelsAI bias
0
0 comments X

The pith

Longer reasoning trajectories increase position bias in multiple-choice questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the assumption that chain-of-thought reasoning reduces shallow biases in language models. It shows instead that position bias grows with the length of the reasoning process in any capable model. This scaling holds after controlling for accuracy and is supported by truncation experiments that resume reasoning from later points. The pattern appears across model sizes and datasets, with direct-answer modes showing a separate, length-unrelated bias pattern.

Core claim

Within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory. Twelve of thirteen tested configurations exhibit positive partial correlation between trajectory length and Position Bias Score after accuracy controls. Truncation interventions provide causal evidence that continuations from later points shift toward position-preferred options. At the largest scale, aggregate bias is low but the length effect persists in the longest quartile.

What carries the argument

Position Bias Score (PBS), which isolates position preference while controlling for accuracy, analyzed via partial correlations with trajectory length and truncation probes that resume reasoning from intermediate points.

If this is right

  • Reasoning models cannot be treated as order-robust by default when evaluating multiple-choice questions.
  • Chain-of-thought replaces baseline direct-answer bias with a distinct length-accumulated bias.
  • High overall accuracy can mask length-driven bias, but the mechanism remains active in the longest trajectories.
  • Diagnostic tools such as PBS, commitment change points, and truncation probes can audit position bias in reasoning systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications that rely on long reasoning chains may need explicit length controls or post-hoc bias checks to maintain fairness.
  • The length-bias relationship could be tested in non-MCQ tasks where sequential ordering influences decisions.
  • Model training objectives might incorporate penalties on length-correlated biases to reduce the effect at the source.

Load-bearing premise

The Position Bias Score and partial correlation controls fully isolate length-driven bias from accuracy, model-specific effects, and other confounds.

What would settle it

A truncation experiment in which resuming from later points in the trajectory produces no increase in shifts toward position-preferred options would falsify the causal claim.

Figures

Figures reproduced from arXiv: 2605.06672 by Xiao Wang.

Figure 1
Figure 1. Figure 1: Length-quartile PBS is monotonic across scales. R1-Qwen-7B and R1-Llama-8B show PBS growing 3–4× from shortest to longest length quartile on MMLU; similar patterns hold on ARC and GPQA. At 671B (MMLU, green), PBS is essentially zero on the first three quartiles and 0.071 on the longest, showing that the length-driven mechanism persists at scale but is gated by question difficulty. Model Mode Partial ρ(leng… view at source ↗
Figure 2
Figure 2. Figure 2: Per-configuration view of the length–PBS relationship for the four local R1-distilled panels. Points are quartile means with standard-error bars; dashed lines are linear fits on the quartile centers. Partial ρ (controlling for accuracy) is annotated in each panel. For R1-Qwen-7B, the directional shift toward position A (the position-preferred option in our dataset) increases monotonically from 16% at traje… view at source ↗
Figure 3
Figure 3. Figure 3: Truncation intervention on MMLU for R1-Qwen-7B and Qwen-Instruct-CoT. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PBS vs mean trajectory length (log scale) for all six (model, mode) configurations per benchmark. Direct-mode points (grey) form a separate cluster with family-specific baseline bias. Reasoning-mode points (blue / red) fall along an upward length–PBS trajectory (light red trend line). Shading demarcates the two regimes. reasoning accumulates more attention weight on positional features; under soft priors, … view at source ↗
read the original abstract

Chain-of-thought (CoT) reasoning and reasoning-tuned models such as DeepSeek-R1 are commonly assumed to reduce shallow heuristic biases by thinking carefully. We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory. Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles. A truncation intervention provides causal evidence: continuations resumed from later points in the trajectory are increasingly likely to shift toward position-preferred options (16% to 32% for R1-Qwen-7B across absolute-position buckets). At 671B, aggregate PBS collapses to 0.019, but the length effect still manifests in the longest quartile (PBS = 0.071), suggesting that accuracy gates the expression of length-driven bias rather than eliminating the underlying mechanism. We additionally find that direct-answer position bias is a distinct phenomenon with a different footprint (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct, and uncorrelated with trajectory length): CoT reasoning replaces this baseline bias with length-accumulated bias. Our results argue that reasoning-capable models should not be treated as order-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias in reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that within reasoning-capable models, per-question position bias in multiple-choice QA scales with the length of the reasoning trajectory. Across thirteen configurations (R1-distilled models, base models with CoT, and DeepSeek-R1) on MMLU, ARC-Challenge, and GPQA, twelve exhibit positive partial correlations (0.11–0.41, p<0.05) between trajectory length and Position Bias Score (PBS) after controlling for accuracy; all twelve open-weight setups show monotonic PBS increases across length quartiles. A truncation intervention supplies causal evidence via increased shifts toward position-preferred options (16–32%) when resuming from later trajectory points. The work distinguishes this length-driven effect from direct-answer position bias and notes that at 671B scale the aggregate PBS is low but the length effect persists in the longest quartile.

Significance. If the central claim holds, the results are significant for MCQ evaluation pipelines, as they indicate that CoT reasoning can replace baseline position bias with a length-accumulated form rather than eliminating bias. The multi-model, multi-dataset design, accuracy controls via partial correlation, and proposed diagnostics (PBS, commitment change point, truncation probes) constitute clear strengths. The empirical measurement approach with falsifiable patterns across 13 setups adds value, though the causal isolation of length from other trajectory properties remains the key point requiring further support.

major comments (1)
  1. [Truncation intervention (results)] The truncation intervention (results describing the probe) is presented as causal evidence that longer trajectories drive position bias, with reported shifts of 16–32% toward position-preferred options. However, resumption from later points necessarily includes greater accumulated tokens, prior commitments, and semantic context; any bias increase could therefore arise from these state differences rather than token count alone. The manuscript should add controls that match resumption points on equivalent reasoning depth or content type to isolate length per se.
minor comments (2)
  1. [Abstract] The abstract reports that twelve of thirteen configurations show the positive partial correlation but does not identify the single exception; stating which configuration fails to show the effect would improve interpretability.
  2. [Methods / PBS definition] The exact operational definition of the Position Bias Score (PBS) and the precise covariates used in the partial-correlation analysis should be stated explicitly in the main text (or a dedicated methods subsection) rather than assumed from the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for your thorough review and constructive feedback on our manuscript. We appreciate the referee's careful reading of the truncation intervention and the opportunity to address concerns about causal isolation. We respond to the major comment below.

read point-by-point responses
  1. Referee: The truncation intervention (results describing the probe) is presented as causal evidence that longer trajectories drive position bias, with reported shifts of 16–32% toward position-preferred options. However, resumption from later points necessarily includes greater accumulated tokens, prior commitments, and semantic context; any bias increase could therefore arise from these state differences rather than token count alone. The manuscript should add controls that match resumption points on equivalent reasoning depth or content type to isolate length per se.

    Authors: We thank the referee for this insightful observation. The truncation probe was designed to test whether bias increases when generation resumes after more prior reasoning has occurred, producing the reported 16–32% shifts. However, we fully agree that resumption from later points confounds additional token count with accumulated semantic context, prior commitments, and reasoning depth, so the design does not isolate length per se. Adding controls that match resumption points on equivalent reasoning depth or content type would require new experiments (e.g., generating matched-length but semantically varied trajectories), which we did not conduct. We will therefore make a partial revision: (1) change phrasing in the abstract and results from “causal evidence” to “supporting evidence” or “suggestive of a length-driven mechanism”; (2) add an explicit Limitations paragraph acknowledging the confounding factors and the difficulty of cleanly separating length from content; and (3) highlight that the partial-correlation and quartile results provide independent, non-interventional support for the overall claim. This revision clarifies the strength of the evidence without overstating causality. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurement study

full rationale

The paper defines Position Bias Score (PBS) as a metric, computes partial correlations between reasoning trajectory length and PBS (controlling for accuracy), and reports results from a truncation intervention across models and datasets. No derivation chain, first-principles predictions, or equations are present that could reduce to inputs by construction. There are no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that justify the central claims. The analysis relies on direct empirical measurements and interventions, which are self-contained and falsifiable against external data. Minor self-citations, if any, are not load-bearing for the reported correlations or intervention results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the validity of the PBS metric as a bias measure independent of accuracy and the assumption that trajectory length is a causal driver isolatable via truncation.

axioms (1)
  • domain assumption Position Bias Score (PBS) validly quantifies order preference after accuracy controls
    Central to all reported partial correlations and quartile trends.

pith-pipeline@v0.9.0 · 5641 in / 1147 out tokens · 34094 ms · 2026-05-11T01:20:12.541349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, et al. DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. Nature, 645: 0 633--638, 2025. doi:10.1038/s41586-025-09422-z. URL https://arxiv.org/abs/2501.12948. arXiv:2501.12948

  3. [3]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  4. [6]

    QwQ : Reflect deeply on the boundaries of the unknown

    Qwen Team . QwQ : Reflect deeply on the boundaries of the unknown. https://qwenlm.github.io/blog/qwq-32b-preview/, 2024. Blog post

  5. [7]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level Google-Proof Q&A benchmark. In Conference on Language Modeling (COLM), 2024. URL https://arxiv.org/abs/2311.12022

  6. [10]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URL https://arxiv.o...

  7. [11]

    2025 , doi =

    Nature , volume =. 2025 , doi =

  8. [12]

    Does Reasoning Introduce Bias?

    Wu, Xuyang and Nian, Jinming and Wei, Ting-Ruen and Tao, Zhiqiang and Wu, Hsin-Tai and Fang, Yi , booktitle =. Does Reasoning Introduce Bias?. 2025 , address =. doi:10.18653/v1/2025.findings-emnlp.1006 , url =

  9. [13]

    arXiv preprint arXiv:2510.17062 , year =

    Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation , author =. arXiv preprint arXiv:2510.17062 , year =

  10. [14]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =

  11. [15]

    Large Language Models are not Fair Evaluators

    Large Language Models are not Fair Evaluators , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , address =. doi:10.18653/v1/2024.acl-long.511 , url =

  12. [16]

    Findings of the Association for Computational Linguistics: NAACL 2024 , pages =

    Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions , author =. Findings of the Association for Computational Linguistics: NAACL 2024 , pages =. 2024 , address =. doi:10.18653/v1/2024.findings-naacl.130 , url =

  13. [17]

    International Conference on Learning Representations (ICLR) , year =

    Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations (ICLR) , year =

  14. [18]

    Think you have Solved Question Answering?

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal =. Think you have Solved Question Answering?. 2018 , url =

  15. [19]

    , booktitle =

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle =. 2024 , url =