pith. machine review for the scientific record. sign in

arxiv: 2604.26644 · v1 · submitted 2026-04-29 · 💻 cs.AI

Recognition: unknown

When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords test-time scalingdisagreement routingmathematical reasoninglarge reasoning modelsmajority votingself-rewritingstrategy selectioninference optimization
0
0 comments X

The pith

Output disagreement routes test-time scaling between light fixes, voting, and rewriting to raise accuracy and cut costs on math tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models often fail on difficult math problems even when extra compute is applied at test time through sampling or correction. The work finds that disagreement among multiple outputs reliably signals both instance difficulty and likely errors. It therefore introduces a training-free router that assigns lightweight resolution to consistent cases, majority voting to moderate disagreement, and rewriting-based reformulation to highly ambiguous ones. On seven mathematical benchmarks and three models the routed approach delivers 3 to 7 percent higher accuracy while using less total sampling than fixed-strategy baselines.

Core claim

We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness. This correlation allows test-time scaling to be recast as an instance-level routing problem that selects among strategies—lightweight resolution for low disagreement, majority voting for medium disagreement, and rewriting for high disagreement—rather than applying one strategy uniformly.

What carries the argument

The disagreement-guided router that measures variance across sampled outputs and dispatches each instance to the matching scaling strategy.

If this is right

  • Low-disagreement instances are solved accurately with minimal sampling.
  • Moderate disagreement benefits from voting to aggregate multiple predictions.
  • High disagreement triggers rewriting to reformulate and resolve ambiguity.
  • Across models the method yields 3-7% accuracy gains at reduced total sampling cost.
  • No model retraining is required for the routing decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disagreement signal might adaptively allocate compute in code or planning domains where hardness is similarly reflected in output variance.
  • Learned rather than fixed disagreement thresholds could further optimize the routing boundaries per model.
  • Pairing the router with tree search inside the rewrite branch could compound gains on the hardest instances.

Load-bearing premise

Disagreement among outputs is a reliable indicator of both problem difficulty and whether the answer is correct.

What would settle it

If a new set of math problems shows that high-disagreement instances do not have lower accuracy or if the router fails to beat uniform majority voting on accuracy or cost, the routing benefit would be refuted.

Figures

Figures reproduced from arXiv: 2604.26644 by Dong Li, Jinpeng Li, Junhua Fang, Juntao Li, Min Zhang, Yixin Ji, Yu Luo, Zhimin Lin.

Figure 1
Figure 1. Figure 1: The effect of rewriting and majority voting (6 samplings) for Qwen3-8B across instances of view at source ↗
Figure 2
Figure 2. Figure 2: This figure illustrates the overall workflow of our method, where MDD, NDS, MDS, and view at source ↗
Figure 3
Figure 3. Figure 3: The number of samplings used by different methods. The dashed line indicates the view at source ↗
Figure 4
Figure 4. Figure 4: Average tokens and wall-clock time per sample for Qwen3-8B on Math500, Gaokao En, view at source ↗
Figure 5
Figure 5. Figure 5: (Left) Effective and harmful rewriting ratios across different error types. (Right) Comparison view at source ↗
Figure 6
Figure 6. Figure 6: The figure shows the recall (%) of incorrect samples as a function of the number of iterations view at source ↗
read the original abstract

Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a training-free framework for test-time scaling in large reasoning models that treats scaling as an instance-level routing problem: it measures output disagreement across samples and routes low-disagreement instances to lightweight resolution, moderate-disagreement cases to majority voting, and high-disagreement instances to rewriting-based reformulation. Experiments on seven mathematical benchmarks and three models are reported to yield 3-7% accuracy gains with reduced sampling cost relative to existing uniform scaling approaches.

Significance. If the disagreement signal proves reliable for partitioning instances into regimes where each strategy is optimal, the approach could improve the efficiency of test-time compute by avoiding unnecessary application of expensive methods on easy cases. The training-free design and evaluation across multiple benchmarks and models are positive features that would support practical adoption if the routing decisions are shown to be robust.

major comments (2)
  1. [§4] §4 (Experimental Results): The central claim of 3-7% accuracy improvement and cost reduction depends on the superiority of disagreement-guided routing, yet the manuscript provides no explicit ablation that applies the three strategies (lightweight resolution, majority voting, rewriting) uniformly to all instances and compares against the routed version. Without this control, the gains could arise primarily from the voting component on moderate cases rather than from the routing decisions themselves.
  2. [§3] §3 (Method): The disagreement metric, its computation across samples, and the procedure for selecting routing thresholds are not described with sufficient detail to determine whether thresholds were fixed a priori or tuned post-hoc on the same benchmarks. This is load-bearing because post-hoc selection on the evaluation data would undermine the claim that disagreement provides a general, reliable signal for strategy selection.
minor comments (2)
  1. [Abstract] Abstract and §2: The statement that disagreement is 'strongly correlated with instance difficulty and prediction correctness' should be supported by a quantitative plot or table (e.g., correlation coefficient or accuracy-vs-disagreement curve) rather than left as a qualitative observation.
  2. [§4] §4: The manuscript should report statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the accuracy improvements and include variance across random seeds for the sampling-based methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and method.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): The central claim of 3-7% accuracy improvement and cost reduction depends on the superiority of disagreement-guided routing, yet the manuscript provides no explicit ablation that applies the three strategies (lightweight resolution, majority voting, rewriting) uniformly to all instances and compares against the routed version. Without this control, the gains could arise primarily from the voting component on moderate cases rather than from the routing decisions themselves.

    Authors: We agree that an explicit ablation applying each strategy uniformly would provide stronger evidence isolating the benefit of routing. In the revised manuscript, we will add experiments that apply lightweight resolution, majority voting, and rewriting uniformly to every instance and directly compare accuracy and sampling cost against the disagreement-guided routed version. Our existing comparisons are to prior uniform scaling baselines, but we acknowledge the value of this additional control using the same strategy set. revision: yes

  2. Referee: [§3] §3 (Method): The disagreement metric, its computation across samples, and the procedure for selecting routing thresholds are not described with sufficient detail to determine whether thresholds were fixed a priori or tuned post-hoc on the same benchmarks. This is load-bearing because post-hoc selection on the evaluation data would undermine the claim that disagreement provides a general, reliable signal for strategy selection.

    Authors: We appreciate the request for greater clarity. The disagreement metric is the fraction of pairwise differing solutions among K samples drawn for an instance. Thresholds are set a priori as fixed quantiles of the disagreement distribution observed on a small held-out validation set of problems drawn from the same distribution as the benchmarks, without access to test labels or post-hoc adjustment on evaluation data. In the revision we will expand §3 with the precise formula, pseudocode for routing, and explicit description of the threshold selection process to confirm it is training-free and general. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical routing from observed disagreement, training-free

full rationale

The paper's core contribution is a training-free instance-level routing rule that maps measured output disagreement (across samples) into one of three fixed strategies: lightweight resolution for low disagreement, majority voting for moderate, and rewriting for high. This mapping is presented as a direct consequence of an observed empirical correlation between disagreement, difficulty, and correctness; no parameters are fitted to the final accuracy metric, no equations are solved by construction, and no derivation reduces the reported gains to the input data by tautology. Experiments on held-out benchmarks simply measure the outcome of applying the rule. No load-bearing self-citations, ansatz smuggling, or uniqueness theorems appear in the abstract or described method. The approach is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that disagreement correlates with difficulty and correctness, plus the choice of three discrete routing tiers; no new physical entities or unstated mathematical axioms are introduced.

free parameters (1)
  • disagreement thresholds
    Boundaries separating low, moderate, and high disagreement for strategy selection are not specified in the abstract and are presumed to be chosen or tuned.
axioms (1)
  • domain assumption Output disagreement correlates with instance difficulty and prediction correctness
    Stated directly in the abstract as the key insight enabling the routing framework.

pith-pipeline@v0.9.0 · 5483 in / 1180 out tokens · 42129 ms · 2026-05-07T10:55:04.877182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 34 canonical work pages · 15 internal anchors

  1. [1]

    Agarwal, A

    A. Agarwal, A. Sengupta, and T. Chakraborty. The art of scaling test-time compute for large language models, 2025. URLhttps://arxiv.org/abs/2512.02008

  2. [2]

    Aggarwal, A

    P. Aggarwal, A. Madaan, Y . Yang, and Mausam. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms, 2023. URL https://arxiv.org/abs/2305. 11860

  3. [3]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021. URL https: //arxiv.org/abs/2108.07732

  4. [4]

    Z. Bi, K. Han, C. Liu, Y . Tang, and Y . Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.arXiv preprint arXiv:2412.09078, 2024

  5. [5]

    J. Chen, B. Wang, Z. Jiang, and Y . Nakashima. Putting people in llms’ shoes: Generating better answers via question rewriter. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23577–23585, 2025

  6. [6]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

  7. [7]

    X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug, 2023. URLhttps://arxiv.org/abs/2304.05128

  8. [8]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

  10. [10]

    DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Chen, J. Yuan, J. Qiu, J. Song, K. Dong, K. Gao, K. Guan, L. Wan...

  11. [11]

    K. Gao, H. Cai, Q. Shuai, D. Gong, and Z. Li. Embedding self-correction as an inherent ability in large language models for enhanced mathematical reasoning, 2025. URL https: //arxiv.org/abs/2410.10735

  12. [12]

    Z. Gao, B. Niu, X. He, H. Xu, H. Liu, A. Liu, X. Hu, and L. Wen. Interpretable contrastive monte carlo tree search reasoning.arXiv preprint arXiv:2410.01707, 2024. 10

  13. [13]

    L. Gui, C. Gârbacea, and V . Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling.Advances in Neural Information Processing Systems, 37: 2851–2885, 2024

  14. [14]

    C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3828–3850, 2024

  15. [15]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Stein- hardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  16. [16]

    Y . Ji, J. Li, Y . Xiang, H. Ye, K. Wu, K. Yao, J. Xu, L. Mo, and M. Zhang. A survey of test-time compute: From intuitive inference to deliberate reasoning, 2025. URL https: //arxiv.org/abs/2501.02497

  17. [17]

    W. Kong, S. Hombaiah, M. Zhang, Q. Mei, and M. Bendersky. Prewrite: Prompt rewriting with reinforcement learning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 594–601, 2024

  18. [18]

    Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

    A. Kumar, V . Zhuang, R. Agarwal, Y . Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

  19. [19]

    H. Lai, X. Zhang, and M. Nissim. Multidimensional consistency improves reasoning in language models.arXiv preprint arXiv:2503.02670, 2025

  20. [20]

    J. Li, Q. Zhang, Y . Yu, Q. Fu, and D. Ye. More agents is all you need, 2024. URL https: //arxiv.org/abs/2402.05120

  21. [21]

    L. Li, Z. Chen, G. Chen, Y . Zhang, Y . Su, E. Xing, and K. Zhang. Confidence matters: Revisiting intrinsic self-correction capabilities of large language models, 2024. URL https: //arxiv.org/abs/2402.12563

  22. [22]

    Z. Li, H. Yuan, H. Wang, G. Cong, and L. Bing. Llm-r2: A large language model enhanced rule-based rewrite system for boosting query efficiency.arXiv preprint arXiv:2404.12872, 2024

  23. [23]

    L. Lin, J. Fu, P. Liu, Q. Li, Y . Gong, J. Wan, F. Zhang, Z. Wang, D. Zhang, and K. Gai. Just ask one more time! self-agreement improves reasoning of language models in (almost) all scenarios,

  24. [24]

    URLhttps://arxiv.org/abs/2311.08154

  25. [25]

    C. Y . Liu, L. Zeng, Y . Xiao, J. He, J. Liu, C. Wang, R. Yan, W. Shen, F. Zhang, J. Xu, Y . Liu, and Y . Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

  26. [26]

    Liu and B

    J. Liu and B. Mozafari. Query rewriting via large language models.arXiv preprint arXiv:2403.09060, 2024

  27. [27]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

  28. [28]

    URLhttps://arxiv.org/abs/2402.03300

  29. [29]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/ 2303.11366

  30. [30]

    L. Shu, L. Luo, J. Hoskere, Y . Zhu, Y . Liu, S. Tong, J. Chen, and L. Meng. Rewritelm: An instruction-tuned large language model for text rewriting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18970–18980, 2024

  31. [31]

    G. Son, J. Hong, H. Ko, and J. Thorne. Linguistic generalizability of test-time scaling in mathematical reasoning, 2025. URLhttps://arxiv.org/abs/2502.17407. 11

  32. [32]

    Q. Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  33. [33]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  34. [34]

    When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

    Y . Xiang, Y . Ji, R. Xu, D. Qiao, Z. Yang, J. Li, and M. Zhang. When is thinking enough? early exit via sufficiency assessment for efficient reasoning, 2026. URL https://arxiv.org/abs/ 2604.06787

  35. [35]

    M. Xue, D. Liu, W. Lei, X. Ren, B. Yang, J. Xie, Y . Zhang, D. Peng, and J. Lv. Dynamic voting for efficient reasoning in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3085–3104, 2023

  36. [36]

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W....

  37. [37]

    A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, K. Lu, M. Xue, R. Lin, T. Liu, X. Ren, and Z. Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024. URLhttps://arxiv.org/abs/2409.12122

  38. [38]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

  39. [39]

    Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B,

    D. Zhang, X. Huang, D. Zhou, Y . Li, and W. Ouyang. Accessing gpt-4 level mathemat- ical olympiad solutions via monte carlo tree self-refine with llama-3 8b.arXiv preprint arXiv:2406.07394, 2024

  40. [40]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y . Wang, N. Muennighoff, I. King, X. Liu, and C. Ma. A survey on test-time scaling in large language models: What, how, where, and how well?, 2025. URLhttps://arxiv.org/abs/2503.24235

  41. [41]

    Zhang, M

    R. Zhang, M. Haider, M. Yin, J. Qiu, M. Wang, P. Bartlett, and A. Zanette. Accelerating best- of-n via speculative rejection. InICML 2024 Workshop on Structured Probabilistic Inference {\&}Generative Modeling, 2024

  42. [42]

    arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474

    X. Zhang, C. Li, Y . Zong, Z. Ying, L. He, and X. Qiu. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023

  43. [43]

    Zhang, S

    Y . Zhang, S. Peng, N. Wu, X. Lin, Y . Hu, and J. Tang. Rm-pot: Reformulating mathematical problems and solving via program of thoughts.arXiv preprint arXiv:2502.12589, 2025

  44. [44]

    J. X. Zhao, B. Hooi, and S.-K. Ng. Test-time scaling in reasoning models is not effective for knowledge-intensive tasks yet, 2026. URLhttps://arxiv.org/abs/2509.06861

  45. [45]

    Y . Zhou, Y . Zhu, D. Antognini, Y . Kim, and Y . Zhang. Paraphrase and solve: Exploring and exploiting the impact of surface form on mathematical reasoning in large language models. arXiv preprint arXiv:2404.11500, 2024. 12 A Limitations This work leaves room for further exploration in certain aspects. • Our framework relies on output disagreement as a p...