arxiv: 2604.26644 · v1 · submitted 2026-04-29 · 💻 cs.AI

Recognition: unknown

When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

Zhimin Lin , Yixin Ji , Jinpeng Li , Yu Luo , Dong Li , Junhua Fang , Juntao Li , Min Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords test-time scalingdisagreement routingmathematical reasoninglarge reasoning modelsmajority votingself-rewritingstrategy selectioninference optimization

0 comments

The pith

Output disagreement routes test-time scaling between light fixes, voting, and rewriting to raise accuracy and cut costs on math tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models often fail on difficult math problems even when extra compute is applied at test time through sampling or correction. The work finds that disagreement among multiple outputs reliably signals both instance difficulty and likely errors. It therefore introduces a training-free router that assigns lightweight resolution to consistent cases, majority voting to moderate disagreement, and rewriting-based reformulation to highly ambiguous ones. On seven mathematical benchmarks and three models the routed approach delivers 3 to 7 percent higher accuracy while using less total sampling than fixed-strategy baselines.

Core claim

We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness. This correlation allows test-time scaling to be recast as an instance-level routing problem that selects among strategies—lightweight resolution for low disagreement, majority voting for medium disagreement, and rewriting for high disagreement—rather than applying one strategy uniformly.

What carries the argument

The disagreement-guided router that measures variance across sampled outputs and dispatches each instance to the matching scaling strategy.

If this is right

Low-disagreement instances are solved accurately with minimal sampling.
Moderate disagreement benefits from voting to aggregate multiple predictions.
High disagreement triggers rewriting to reformulate and resolve ambiguity.
Across models the method yields 3-7% accuracy gains at reduced total sampling cost.
No model retraining is required for the routing decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disagreement signal might adaptively allocate compute in code or planning domains where hardness is similarly reflected in output variance.
Learned rather than fixed disagreement thresholds could further optimize the routing boundaries per model.
Pairing the router with tree search inside the rewrite branch could compound gains on the hardest instances.

Load-bearing premise

Disagreement among outputs is a reliable indicator of both problem difficulty and whether the answer is correct.

What would settle it

If a new set of math problems shows that high-disagreement instances do not have lower accuracy or if the router fails to beat uniform majority voting on accuracy or cost, the routing benefit would be refuted.

Figures

Figures reproduced from arXiv: 2604.26644 by Dong Li, Jinpeng Li, Junhua Fang, Juntao Li, Min Zhang, Yixin Ji, Yu Luo, Zhimin Lin.

**Figure 1.** Figure 1: The effect of rewriting and majority voting (6 samplings) for Qwen3-8B across instances of view at source ↗

**Figure 2.** Figure 2: This figure illustrates the overall workflow of our method, where MDD, NDS, MDS, and view at source ↗

**Figure 3.** Figure 3: The number of samplings used by different methods. The dashed line indicates the view at source ↗

**Figure 4.** Figure 4: Average tokens and wall-clock time per sample for Qwen3-8B on Math500, Gaokao En, view at source ↗

**Figure 5.** Figure 5: (Left) Effective and harmful rewriting ratios across different error types. (Right) Comparison view at source ↗

**Figure 6.** Figure 6: The figure shows the recall (%) of incorrect samples as a function of the number of iterations view at source ↗

read the original abstract

Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames test-time scaling as disagreement-based routing across resolution, voting, and rewriting, with modest 3-7% gains on math benchmarks but thin evidence that the routing itself beats fixed strategies.

read the letter

The main takeaway is a training-free router that measures output disagreement and picks the scaling strategy per instance: quick resolution when samples agree, majority voting when they differ moderately, and rewriting when they are highly ambiguous. This shifts from scaling inside one method to choosing among them based on the signal from multiple samples. The observation that disagreement tracks difficulty and correctness is intuitive and matches patterns in sampling literature. Testing on seven math benchmarks with three models and reporting accuracy lifts alongside lower sampling cost gives the work some practical scope without requiring retraining. That training-free property is a real plus for anyone trying to deploy reasoning models under compute limits. The soft spots center on whether the routing logic is doing the heavy lifting. The abstract gives no specifics on the disagreement metric, how thresholds are chosen, or statistical controls, and it lacks an explicit comparison of the routed system against applying voting or rewriting uniformly. If thresholds were tuned after seeing results on these benchmarks, the 3-7% improvement could partly reflect that choice rather than a general property of the disagreement signal. The stress-test point about the signal not reliably partitioning instances into optimal regimes holds weight here until ablations appear. The citation pattern covers the usual repeated sampling, self-correction, and tree-search baselines without obvious gaps. This is for engineers and researchers working on efficient test-time methods for reasoning models who need something lightweight to implement and test. A reader focused on practical inference budgets would find the empirical breadth useful even if the framing is incremental. It deserves a serious referee because the setup is broad enough to evaluate and the core idea is easy to reproduce and stress-test. I would send it to peer review with requests for threshold details, fixed-strategy ablations, and checks on whether the gains generalize beyond the reported benchmarks.

Referee Report

2 major / 2 minor

Summary. The paper proposes a training-free framework for test-time scaling in large reasoning models that treats scaling as an instance-level routing problem: it measures output disagreement across samples and routes low-disagreement instances to lightweight resolution, moderate-disagreement cases to majority voting, and high-disagreement instances to rewriting-based reformulation. Experiments on seven mathematical benchmarks and three models are reported to yield 3-7% accuracy gains with reduced sampling cost relative to existing uniform scaling approaches.

Significance. If the disagreement signal proves reliable for partitioning instances into regimes where each strategy is optimal, the approach could improve the efficiency of test-time compute by avoiding unnecessary application of expensive methods on easy cases. The training-free design and evaluation across multiple benchmarks and models are positive features that would support practical adoption if the routing decisions are shown to be robust.

major comments (2)

[§4] §4 (Experimental Results): The central claim of 3-7% accuracy improvement and cost reduction depends on the superiority of disagreement-guided routing, yet the manuscript provides no explicit ablation that applies the three strategies (lightweight resolution, majority voting, rewriting) uniformly to all instances and compares against the routed version. Without this control, the gains could arise primarily from the voting component on moderate cases rather than from the routing decisions themselves.
[§3] §3 (Method): The disagreement metric, its computation across samples, and the procedure for selecting routing thresholds are not described with sufficient detail to determine whether thresholds were fixed a priori or tuned post-hoc on the same benchmarks. This is load-bearing because post-hoc selection on the evaluation data would undermine the claim that disagreement provides a general, reliable signal for strategy selection.

minor comments (2)

[Abstract] Abstract and §2: The statement that disagreement is 'strongly correlated with instance difficulty and prediction correctness' should be supported by a quantitative plot or table (e.g., correlation coefficient or accuracy-vs-disagreement curve) rather than left as a qualitative observation.
[§4] §4: The manuscript should report statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the accuracy improvements and include variance across random seeds for the sampling-based methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and method.

read point-by-point responses

Referee: [§4] §4 (Experimental Results): The central claim of 3-7% accuracy improvement and cost reduction depends on the superiority of disagreement-guided routing, yet the manuscript provides no explicit ablation that applies the three strategies (lightweight resolution, majority voting, rewriting) uniformly to all instances and compares against the routed version. Without this control, the gains could arise primarily from the voting component on moderate cases rather than from the routing decisions themselves.

Authors: We agree that an explicit ablation applying each strategy uniformly would provide stronger evidence isolating the benefit of routing. In the revised manuscript, we will add experiments that apply lightweight resolution, majority voting, and rewriting uniformly to every instance and directly compare accuracy and sampling cost against the disagreement-guided routed version. Our existing comparisons are to prior uniform scaling baselines, but we acknowledge the value of this additional control using the same strategy set. revision: yes
Referee: [§3] §3 (Method): The disagreement metric, its computation across samples, and the procedure for selecting routing thresholds are not described with sufficient detail to determine whether thresholds were fixed a priori or tuned post-hoc on the same benchmarks. This is load-bearing because post-hoc selection on the evaluation data would undermine the claim that disagreement provides a general, reliable signal for strategy selection.

Authors: We appreciate the request for greater clarity. The disagreement metric is the fraction of pairwise differing solutions among K samples drawn for an instance. Thresholds are set a priori as fixed quantiles of the disagreement distribution observed on a small held-out validation set of problems drawn from the same distribution as the benchmarks, without access to test labels or post-hoc adjustment on evaluation data. In the revision we will expand §3 with the precise formula, pseudocode for routing, and explicit description of the threshold selection process to confirm it is training-free and general. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical routing from observed disagreement, training-free

full rationale

The paper's core contribution is a training-free instance-level routing rule that maps measured output disagreement (across samples) into one of three fixed strategies: lightweight resolution for low disagreement, majority voting for moderate, and rewriting for high. This mapping is presented as a direct consequence of an observed empirical correlation between disagreement, difficulty, and correctness; no parameters are fitted to the final accuracy metric, no equations are solved by construction, and no derivation reduces the reported gains to the input data by tautology. Experiments on held-out benchmarks simply measure the outcome of applying the rule. No load-bearing self-citations, ansatz smuggling, or uniqueness theorems appear in the abstract or described method. The approach is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that disagreement correlates with difficulty and correctness, plus the choice of three discrete routing tiers; no new physical entities or unstated mathematical axioms are introduced.

free parameters (1)

disagreement thresholds
Boundaries separating low, moderate, and high disagreement for strategy selection are not specified in the abstract and are presumed to be chosen or tuned.

axioms (1)

domain assumption Output disagreement correlates with instance difficulty and prediction correctness
Stated directly in the abstract as the key insight enabling the routing framework.

pith-pipeline@v0.9.0 · 5483 in / 1180 out tokens · 42129 ms · 2026-05-07T10:55:04.877182+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 34 canonical work pages · 15 internal anchors

[1]

Agarwal, A

A. Agarwal, A. Sengupta, and T. Chakraborty. The art of scaling test-time compute for large language models, 2025. URLhttps://arxiv.org/abs/2512.02008

work page arXiv 2025
[2]

Aggarwal, A

P. Aggarwal, A. Madaan, Y . Yang, and Mausam. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms, 2023. URL https://arxiv.org/abs/2305. 11860

2023
[3]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021. URL https: //arxiv.org/abs/2108.07732

work page internal anchor Pith review arXiv 2021
[4]

Z. Bi, K. Han, C. Liu, Y . Tang, and Y . Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.arXiv preprint arXiv:2412.09078, 2024

work page arXiv 2024
[5]

J. Chen, B. Wang, Z. Jiang, and Y . Nakashima. Putting people in llms’ shoes: Generating better answers via question rewriter. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23577–23585, 2025

2025
[6]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page internal anchor Pith review arXiv 2021
[7]

X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug, 2023. URLhttps://arxiv.org/abs/2304.05128

work page internal anchor Pith review arXiv 2023
[8]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review arXiv 2021
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review arXiv 2025
[10]

DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Chen, J. Yuan, J. Qiu, J. Song, K. Dong, K. Gao, K. Guan, L. Wan...

work page internal anchor Pith review arXiv 2024
[11]

K. Gao, H. Cai, Q. Shuai, D. Gong, and Z. Li. Embedding self-correction as an inherent ability in large language models for enhanced mathematical reasoning, 2025. URL https: //arxiv.org/abs/2410.10735

work page arXiv 2025
[12]

Z. Gao, B. Niu, X. He, H. Xu, H. Liu, A. Liu, X. Hu, and L. Wen. Interpretable contrastive monte carlo tree search reasoning.arXiv preprint arXiv:2410.01707, 2024. 10

work page arXiv 2024
[13]

L. Gui, C. Gârbacea, and V . Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling.Advances in Neural Information Processing Systems, 37: 2851–2885, 2024

2024
[14]

C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3828–3850, 2024

2024
[15]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Stein- hardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review arXiv 2021
[16]

Y . Ji, J. Li, Y . Xiang, H. Ye, K. Wu, K. Yao, J. Xu, L. Mo, and M. Zhang. A survey of test-time compute: From intuitive inference to deliberate reasoning, 2025. URL https: //arxiv.org/abs/2501.02497

work page arXiv 2025
[17]

W. Kong, S. Hombaiah, M. Zhang, Q. Mei, and M. Bendersky. Prewrite: Prompt rewriting with reinforcement learning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 594–601, 2024

2024
[18]

Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

A. Kumar, V . Zhuang, R. Agarwal, Y . Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

work page arXiv 2024
[19]

H. Lai, X. Zhang, and M. Nissim. Multidimensional consistency improves reasoning in language models.arXiv preprint arXiv:2503.02670, 2025

work page arXiv 2025
[20]

J. Li, Q. Zhang, Y . Yu, Q. Fu, and D. Ye. More agents is all you need, 2024. URL https: //arxiv.org/abs/2402.05120

work page arXiv 2024
[21]

L. Li, Z. Chen, G. Chen, Y . Zhang, Y . Su, E. Xing, and K. Zhang. Confidence matters: Revisiting intrinsic self-correction capabilities of large language models, 2024. URL https: //arxiv.org/abs/2402.12563

work page arXiv 2024
[22]

Z. Li, H. Yuan, H. Wang, G. Cong, and L. Bing. Llm-r2: A large language model enhanced rule-based rewrite system for boosting query efficiency.arXiv preprint arXiv:2404.12872, 2024

work page arXiv 2024
[23]

L. Lin, J. Fu, P. Liu, Q. Li, Y . Gong, J. Wan, F. Zhang, Z. Wang, D. Zhang, and K. Gai. Just ask one more time! self-agreement improves reasoning of language models in (almost) all scenarios,
[24]

URLhttps://arxiv.org/abs/2311.08154

work page arXiv
[25]

C. Y . Liu, L. Zeng, Y . Xiao, J. He, J. Liu, C. Wang, R. Yan, W. Shen, F. Zhang, J. Xu, Y . Liu, and Y . Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

work page arXiv 2025
[26]

Liu and B

J. Liu and B. Mozafari. Query rewriting via large language models.arXiv preprint arXiv:2403.09060, 2024

work page arXiv 2024
[27]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,
[28]

URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review arXiv
[29]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/ 2303.11366

work page internal anchor Pith review arXiv 2023
[30]

L. Shu, L. Luo, J. Hoskere, Y . Zhu, Y . Liu, S. Tong, J. Chen, and L. Meng. Rewritelm: An instruction-tuned large language model for text rewriting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18970–18980, 2024

2024
[31]

G. Son, J. Hong, H. Ko, and J. Thorne. Linguistic generalizability of test-time scaling in mathematical reasoning, 2025. URLhttps://arxiv.org/abs/2502.17407. 11

work page arXiv 2025
[32]

Q. Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review arXiv 2025
[33]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review arXiv 2022
[34]

When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

Y . Xiang, Y . Ji, R. Xu, D. Qiao, Z. Yang, J. Li, and M. Zhang. When is thinking enough? early exit via sufficiency assessment for efficient reasoning, 2026. URL https://arxiv.org/abs/ 2604.06787

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

M. Xue, D. Liu, W. Lei, X. Ren, B. Yang, J. Xie, Y . Zhang, D. Peng, and J. Lv. Dynamic voting for efficient reasoning in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3085–3104, 2023

2023
[36]

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W....

work page internal anchor Pith review arXiv 2024
[37]

A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, K. Lu, M. Xue, R. Lin, T. Liu, X. Ren, and Z. Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024. URLhttps://arxiv.org/abs/2409.12122

work page internal anchor Pith review arXiv 2024
[38]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

2023
[39]

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B,

D. Zhang, X. Huang, D. Zhou, Y . Li, and W. Ouyang. Accessing gpt-4 level mathemat- ical olympiad solutions via monte carlo tree self-refine with llama-3 8b.arXiv preprint arXiv:2406.07394, 2024

work page arXiv 2024
[40]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y . Wang, N. Muennighoff, I. King, X. Liu, and C. Ma. A survey on test-time scaling in large language models: What, how, where, and how well?, 2025. URLhttps://arxiv.org/abs/2503.24235

work page internal anchor Pith review arXiv 2025
[41]

Zhang, M

R. Zhang, M. Haider, M. Yin, J. Qiu, M. Wang, P. Bartlett, and A. Zanette. Accelerating best- of-n via speculative rejection. InICML 2024 Workshop on Structured Probabilistic Inference {\&}Generative Modeling, 2024

2024
[42]

arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474

X. Zhang, C. Li, Y . Zong, Z. Ying, L. He, and X. Qiu. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023

work page arXiv 2023
[43]

Zhang, S

Y . Zhang, S. Peng, N. Wu, X. Lin, Y . Hu, and J. Tang. Rm-pot: Reformulating mathematical problems and solving via program of thoughts.arXiv preprint arXiv:2502.12589, 2025

work page arXiv 2025
[44]

J. X. Zhao, B. Hooi, and S.-K. Ng. Test-time scaling in reasoning models is not effective for knowledge-intensive tasks yet, 2026. URLhttps://arxiv.org/abs/2509.06861

work page arXiv 2026
[45]

Y . Zhou, Y . Zhu, D. Antognini, Y . Kim, and Y . Zhang. Paraphrase and solve: Exploring and exploiting the impact of surface form on mathematical reasoning in large language models. arXiv preprint arXiv:2404.11500, 2024. 12 A Limitations This work leaves room for further exploration in certain aspects. • Our framework relies on output disagreement as a p...

work page arXiv 2024