arxiv: 2605.07654 · v1 · submitted 2026-05-08 · 📊 stat.ML · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Reliable Chain-of-Thought via Prefix Consistency

Junpei Komiyama, Mohammad Atif Quamar, Naoto Iwase, Yuki Ichihara

Pith reviewed 2026-05-11 02:20 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LG

keywords chain-of-thoughtself-consistencyprefix consistencymajority votingreasoning taskstoken efficiencyLLM inference

0 comments

The pith

Prefix consistency from how often answers reappear after truncating CoT traces weights votes to match full majority voting accuracy with far fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that correct Chain-of-Thought traces reproduce their final answer more often than incorrect ones when the trace is truncated and the suffix regenerated from the prefix. This reproduction rate becomes a weighting signal for aggregating multiple sampled traces via majority voting. The signal requires no model probabilities or self-rating prompts. It serves as the strongest predictor of correctness in most settings across models and benchmarks. If the signal holds, self-consistency reaches its accuracy plateau using substantially less computation.

Core claim

Truncating a CoT trace partway through and regenerating the remainder shows that traces ending in the correct answer reproduce that answer more frequently than traces ending in a wrong answer. This difference, called prefix consistency, is used to weight each candidate answer when performing majority voting over sampled traces, yielding more reliable aggregation than uniform voting.

What carries the argument

Prefix consistency: the rate at which a candidate answer reappears when its original CoT trace is truncated at a random point and the suffix is regenerated.

If this is right

Prefix consistency predicts correctness better than compared baselines in most settings.
Reweighting votes by prefix consistency reaches standard majority-voting accuracy with up to 21 times fewer tokens, median 4.6 times fewer.
The method needs no access to token log-probabilities or extra self-rating prompts.
It applies across five reasoning models and four math and science benchmarks.
Savings arise because fewer complete traces suffice to reach the same performance plateau.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prefix consistency may capture structural differences in how models generate correct versus incorrect reasoning paths.
The signal could be used during training to upweight or filter traces that show high internal consistency.
Combining prefix consistency with other low-cost signals might yield even larger token reductions in ensemble reasoning.
The approach suggests that partial-trace stability could generalize to other test-time inference techniques beyond voting.

Load-bearing premise

The observed difference in answer reproduction rates between correct and incorrect traces is stable enough across models, tasks, and truncation points to serve as a reliable weighting signal.

What would settle it

Observing no meaningful difference in reproduction rates between correct and incorrect traces on a new model or benchmark would falsify the reliability of the weighting signal.

Figures

Figures reproduced from arXiv: 2605.07654 by Junpei Komiyama, Mohammad Atif Quamar, Naoto Iwase, Yuki Ichihara.

**Figure 1.** Figure 1: Overview of prefix-consistency-weighted majority voting (PC-WMV). Top: costequivalent accuracy on two models and benchmark settings. Shaded bands and error bars are ±2σ confidence intervals. Bottom: Overview of our proposed method (prefix consistency). Correct reasoning traces exhibit greater reproducibility under regeneration. Preprint. arXiv:2605.07654v1 [stat.ML] 8 May 2026 [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 2.** Figure 2: Per-problem reliability signal distributions for correct vs. wrong traces on FrontierScience-Olympiad with GPT-OSS-120B. Each violin shows the distribution of per-problem mean signal values across the 73 problems that have at least one correct and one wrong trace. On this difficult benchmark (Pass@1 ≈.34), the baseline methods DeepConf tail and P(True) place correct and wrong traces in nearly overlapping r… view at source ↗

**Figure 3.** Figure 3: Per-problem reproduction rates vs. Pass@1. One panel per model. Solid (•): rC . Dashed (×): rW . Both are per-category logistic-regression fits with shaded 2σ cluster-bootstrap CIs over problems and per-problem scatter overlays; curves are labeled with the GLM slope β. rC > rW holds across the full Pass@1 range, including below 50%. Science: FrontierScience-Olympiad. Math: HMMT Feb 2026 ∪ AIME 2025 ∪ Brumo… view at source ↗

**Figure 4.** Figure 4: Macro ROC curves for prefix consistency and the WMV baselines, all 20 (model, benchmark) cells. Rows are models, columns are benchmarks. Each curve is the per-problem ROC averaged on a common false-positive-rate grid, restricted to problems with at least one correct and one wrong initial sample. The dashed diagonal is the random-classifier baseline. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: extends the cost–accuracy analysis of [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Cost–accuracy curves with the full baseline set. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Oracle decomposition: pool coverage vs. reweighting. Same 5x4 layout as [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Per-problem πˆ(a ⋆ ) vs. πd→(a ⋆ ). Each point is one problem and the dashed line marks y = x. Red diamonds with dotted projection lines show each cell’s mean and its coordinates. ∗ flags cells where a paired Wilcoxon test rejects equality (Bonferroni-corrected significance level αsig = 0.05/20, not to be confused with the target-accuracy ratio α of Section 4.3, with the panellevel p-value shown). D.5 Sen… view at source ↗

**Figure 9.** Figure 9: PC’s AUROC advantage is preserved across judges. Each axis is ∆AUROC := AUROCPC − AUROCbest baseline, under self-judge (x) and an external judge (Claude Sonnet 4.6, y); one point per cell (12 cells), and the dashed line is y=x. All 12 points lie in the top-right or bottom-left quadrant; no cell crosses an axis. WMV cost–accuracy curves under both judges nearly overlap [PITH_FULL_IMAGE:figures/full_fig_p03… view at source ↗

**Figure 10.** Figure 10: WMV cost–accuracy curves are stable across judges. 4 models × 3 benchmarks; faded curves use the model’s own judge, saturated curves use an external judge (Claude Sonnet 4.6). The four methods (PC-cubic, Standard MV, DeepConf tail, P(True)) match [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: adds GLM panels for the two models not shown in [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

**Figure 12.** Figure 12: plots the four per-benchmark curves per model in a single panel, and [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗

**Figure 13.** Figure 13: Binned view of [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗

**Figure 14.** Figure 14: LOWESS smoother view of [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗

**Figure 15.** Figure 15: DeepConf tail vs. Pass@1, all five models. Per-class Gaussian linear fits with 2σ cluster-bootstrap CIs over problems. Scatter overlays are per-problem mean confidence. 0 20 40 60 80 100 Pass@1 (%) 0.0 0.2 0.4 0.6 0.8 1.0 P(True) GPT-OSS-120B Science Math Correct (solid, •) Wrong (dashed, ×) 0 20 40 60 80 100 Pass@1 (%) 0.0 0.2 0.4 0.6 0.8 1.0 P(True) GPT-OSS-20B Science Math Correct (solid, •) Wrong (das… view at source ↗

**Figure 16.** Figure 16: P(True) vs. Pass@1, all five models. Same protocol as [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative examples of CoT regeneration after truncation. A correct initial trace (green, top) and a wrong initial trace (red, bottom) on AIME 2025, each truncated at 25% of its CoT and continued from the prefix. G Evaluation Protocol This appendix consolidates the analysis pipeline shared across the experimental sections of the main paper. Implementation choices specific to each baseline are in Appendix… view at source ↗

read the original abstract

Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto-iwase/prefix-consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prefix consistency weights CoT traces by how often they reproduce their answer after truncation, which beats plain majority vote in their tests, but the big token savings probably do not survive once regeneration costs are counted.

read the letter

Prefix consistency is a straightforward addition to self-consistency: truncate each CoT trace, regenerate the suffix, and weight the original answer by how often it reappears. Traces that lead to correct answers show higher reproduction rates, so the method uses that difference as a reliability signal without log probabilities or self-rating prompts. The abstract reports that reweighting reaches standard majority-vote accuracy with up to 21 times fewer tokens (median 4.6 times) across five models and four benchmarks, and the code is released on GitHub. That is the actual contribution worth noting. The idea is new relative to the self-consistency papers it cites and the experiments give it a decent test bed. Releasing code helps anyone who wants to reproduce or extend the weighting rule. The central soft spot is the token accounting. The efficiency claim only holds if the total budget includes every regeneration used to compute the consistency scores. Each candidate trace requires multiple suffix samples, so the overhead per trace is real. If the reported counts only tally the initial full traces, the net savings shrink or disappear, which undercuts the main practical selling point. The abstract also gives no numbers on statistical significance, prompt sensitivity, or how truncation points were chosen, though the full paper may fill those in. This work is aimed at researchers who already run self-consistency at test time and want a cheaper way to get the same accuracy. Readers focused on practical test-time scaling will find the weighting rule easy to try. It deserves a serious referee because the core signal is simple, the scope is reasonable, and the efficiency question is worth checking with proper token ledgers. I would send it to review once the authors show the full compute numbers.

Referee Report

2 major / 2 minor

Summary. The paper proposes prefix consistency as a reliability signal for Chain-of-Thought (CoT) reasoning: truncate a sampled trace partway through and regenerate the suffix multiple times; weight each candidate answer by the fraction of regenerations that reproduce the original answer. This weighting is applied to self-consistency (majority voting) and is shown to outperform uniform voting as a correctness predictor. Across five models and four math/science benchmarks, the method reaches the accuracy plateau of standard majority voting at up to 21x fewer tokens (median 4.6x). The approach requires no token log-probabilities or self-rating prompts; code is released.

Significance. If the efficiency claims survive scrutiny of total token accounting, the work supplies a practical, zero-parameter, model-agnostic improvement to test-time scaling for reasoning tasks. It introduces a new, observable reliability signal grounded directly in regeneration behavior and demonstrates consistent gains over standard self-consistency without additional model access. The public code release supports reproducibility and further experimentation.

major comments (2)

[Results section] Results section (token-efficiency claims): the headline result that prefix consistency reaches standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x) is load-bearing for the central value proposition. The manuscript must explicitly state whether the reported token budgets include the full regeneration overhead (multiple suffix samples per original trace) required to compute the consistency scores. If only initial full traces are counted, the net savings are not demonstrated and the accuracy-per-token claim cannot be evaluated.
[Experimental setup and §4] Experimental setup and §4 (truncation and regeneration protocol): the difference in answer reproduction rates between correct and incorrect traces is treated as a stable weighting signal, yet no details are provided on truncation length distribution, number of regenerations per trace, or controls for prompt sensitivity and temperature. Without these, it is impossible to verify that the observed reliability gap is robust enough to support the reported gains across models and tasks.

minor comments (2)

[Abstract and Methods] The abstract and methods should list the exact five models (including sizes) and four benchmarks with their full names and versions for immediate clarity.
[Figures] Figure captions and axis labels in the efficiency plots should explicitly note whether token counts are per-question averages or totals and whether they are normalized to the MV baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments raise important points about token accounting and experimental details that we address below. We will revise the manuscript to improve clarity on both issues.

read point-by-point responses

Referee: [Results section] Results section (token-efficiency claims): the headline result that prefix consistency reaches standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x) is load-bearing for the central value proposition. The manuscript must explicitly state whether the reported token budgets include the full regeneration overhead (multiple suffix samples per original trace) required to compute the consistency scores. If only initial full traces are counted, the net savings are not demonstrated and the accuracy-per-token claim cannot be evaluated.

Authors: We thank the referee for this observation. The reported token budgets for prefix consistency explicitly include the full regeneration overhead: for each initial trace, the total token count encompasses the k suffix regenerations performed to compute the consistency weight. This net accounting is what enables the comparison showing that prefix consistency reaches the standard MV accuracy plateau with substantially fewer total tokens (up to 21x, median 4.6x). We will add an explicit statement and, if space permits, a short breakdown of token components in the Results section to remove any ambiguity. revision: yes
Referee: [Experimental setup and §4] Experimental setup and §4 (truncation and regeneration protocol): the difference in answer reproduction rates between correct and incorrect traces is treated as a stable weighting signal, yet no details are provided on truncation length distribution, number of regenerations per trace, or controls for prompt sensitivity and temperature. Without these, it is impossible to verify that the observed reliability gap is robust enough to support the reported gains across models and tasks.

Authors: We agree that these protocol details should be stated more explicitly for reproducibility. In the revised §4 we will add: (i) truncation lengths are sampled uniformly from the interval [30%, 70%] of each trace length; (ii) we use 10 regenerations per trace in the main experiments (with ablation on 5 and 20); (iii) all sampling (initial and regeneration) uses the identical prompt template and temperature (0.7) with no additional self-rating or modified instructions. We will also note that prompt sensitivity was not exhaustively varied beyond the standard settings used for all baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines prefix consistency directly from measured reproduction rates of truncated CoT traces, using the empirical difference between correct and incorrect answers as a weighting signal. No equations or steps reduce the method to its own inputs by construction, no parameters are fitted and then relabeled as predictions, and the provided text contains no load-bearing self-citations or imported uniqueness results. The efficiency claims are presented as experimental outcomes rather than tautological derivations, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical method contribution. No free parameters, axioms, or invented entities are stated in the abstract; the central claim rests on the observed statistical difference in regeneration rates.

pith-pipeline@v0.9.0 · 5462 in / 1073 out tokens · 19602 ms · 2026-05-11T02:20:52.764350+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We truncate each sample’s CoT at a specified fraction and regenerate continuations from the prefix... prefix consistency score of a in group i by c(τ)i(a) := |{a′ ∈ A(τ)i : a′ = a}| / 2
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Theorem 1 (PC-WMV strictly improves over Standard MV when rC(τ) > rW(τ))

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

10 Yichao Fu, Xuewei Wang, Hao Zhang, Yuandong Tian, and Jiawei Zhao

URLhttps://openreview.net/forum?id=yf1icZHC-l9. 10 Yichao Fu, Xuewei Wang, Hao Zhang, Yuandong Tian, and Jiawei Zhao. Deep Think with Confidence. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=8LqHs0KIM7. Zeyu Gan, Yun Liao, and Yong Liu. Rethinking External Slow-Thinking: From Snowball Er...

work page 2026
[2]

Hasan Abed Al Kader Hammoud, Hani Itani, and Bernard Ghanem

URLhttps://openreview.net/forum?id=lAjj22UxZy. Hasan Abed Al Kader Hammoud, Hani Itani, and Bernard Ghanem. Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think.arXiv preprint arXiv:2504.20708, 2025. URL https://arxiv.org/abs/2504.20708. Enyi Jiang, Changming Xu, Nischay Singh, Tian Qiu, and Gagandeep Singh. Robust Answers, Fragile Lo...

work page arXiv 2025
[3]

Ishan Jindal, Sai Prashanth Akuthota, Jayant Taneja, and SACHIN DEV SHARMA

URLhttps://arxiv.org/abs/2505.17406. Ishan Jindal, Sai Prashanth Akuthota, Jayant Taneja, and SACHIN DEV SHARMA. THE PATH OF LEAST RESISTANCE: GUIDING LLM REASONING TRAJECTORIES WITH PREFIX CONSENSUS. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=hrnSqERgPn. Saurav Kadavath, Tom Conerly, Am...

work page doi:10.18653/v1/2025.emnlp-main.1025 2026
[4]

12 Alexander von Recum, Leander Girrbach, and Zeynep Akata

URLhttps://aclanthology.org/2023.acl-long.557/. 12 Alexander von Recum, Leander Girrbach, and Zeynep Akata. Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=aQZIpELFwp. Keheng Wang, Feiyu Duan, Sirui Wang, Peiguang Li, Yunse...

work page doi:10.1609/aaai.v38i17.29884 2023
[5]

Scalena et al

studied whether LLMs can express uncertainty and found that verbalized confidence is often poorly calibrated. Scalena et al. [2025] used per-token entropy to allocate compute adaptively during generation, a token-level compute-allocation complement to our sample-level reliability estimation. All of these signals rely on the model’s own log-probabilities o...

work page 2025
[6]

von Recum et al

found that reasoning models can be brittle to minor perturbations in the input. von Recum et al. [2026] systematically evaluated seven intervention types on open-weight reasoning LLMs and found that robustness degrades more when interventions occur early in the CoT. Jiang et al. [2025] showed that correct answers persist even when reasoning logic is pertu...

work page 2026
[7]

Substituting the above probabilities, we obtain Φw(a) =w(1)π(a)T(a→a) +w( 1 2) π(a) +π →(a)−2π(a)T(a→a)

Pr ci(a) = 1 2 . Substituting the above probabilities, we obtain Φw(a) =w(1)π(a)T(a→a) +w( 1 2) π(a) +π →(a)−2π(a)T(a→a) . Rearranging terms gives Φw(a) =w( 1 2) π(a) +π →(a) + w(1)−2w( 1 2) π(a)T(a→a), which is exactly Eq. (17). C.2 Asymptotic Convergence of PC-WMV Throughout this subsection we assume the i.i.d. setup of Proposition 1. By Proposition 2, ...

work page
[8]

= 0, then Assumption 2 is unnecessary: Assumption 1 alone implies ˆaPC N →a ⋆ almost surely. Proof. Fix any wrong a̸=a ⋆. By Assumption 1, the self-reproduction bracket of Eq. (18) is strictly positive, so the self-reproduction term is nonnegative and is strictly positive wheneverλ w >0. 19 For part (a), Assumption 2 makes the pooled-mass bracket of Eq. (...

work page
[9]

Hence a⋆ is the unique maximizer ofΦ w, and Proposition 1 yieldsˆaPC N →a ⋆ almost surely

is strictly positive, so Φw(a⋆)>Φ w(a) for every wrong a. Hence a⋆ is the unique maximizer ofΦ w, and Proposition 1 yieldsˆaPC N →a ⋆ almost surely. For part (b), if w( 1

work page
[10]

Thus ˆaPC N →a ⋆ almost surely without Assumption 2

= 0 the pooled-mass term vanishes identically and λw =w(1)>0 , so the strict positivity of the self-reproduction term suffices to giveΦw(a⋆)>Φ w(a) for every wrong a. Thus ˆaPC N →a ⋆ almost surely without Assumption 2. C.3 Proof of Theorem 1 Theorem 1 follows from Theorem 2 by specializing to the binary case A={a ⋆, a′}, where a⋆ is the correct answer an...

work page
[11]

Their sum is w(1) =λ w + 2w( 1

times the same bracket. Their sum is w(1) =λ w + 2w( 1

work page
[12]

Proof.By Eq

times the bracket: Φw(a⋆)−Φ w(a′) =w(1) π(a⋆)T(a ⋆ →a ⋆)−(1−π(a ⋆))T(a ′ →a ′) .(22) The choice ofwthus enters only throughw(1). Proof.By Eq. (22) andw(1)>0,Φ w(a⋆)>Φ w(a′)if and only if π(a⋆)T(a ⋆ →a ⋆)>(1−π(a ⋆))T(a ′ →a ′). By Proposition 1, this is equivalent to ˆaPC N →a ⋆ almost surely. Identifying rC :=T(a ⋆ →a ⋆) and rW :=T(a ′ →a ′)via Eq. (13) r...

work page
[13]

Hence PC-WMV converges where Standard MV does not on the interval rW /(rC +r W )< π(a ⋆)≤ 1 2, which has positive length if and only ifr C > r W . 20 C.4 Empirical Verification of Assumptions Table 5 verifies that Assumption 1 (A1) and Assumption 2 (A2) hold empirically on Q′ :={q∈ Q: 0< π q(a⋆ q)<1} , the subset of problems on which the verification is n...

work page 2026
[14]

1” token versus “0

on the self-reproduction term π(a)T(a→a) (the joint probability that a is both the initial and the regenerated answer) takes values 0, 1 2, 3 4 for PC-linear, PC-quadratic, PC-cubic respectively, so largern upweights self-reproducing candidates more strongly. The exception is Nemotron2-9B, whose FrontierScience-Olympiad discrimination gap is the smallest ...

work page doi:10.5281/zenodo.20082164 2025