pith. machine review for the scientific record. sign in

arxiv: 2601.21619 · v2 · submitted 2026-01-29 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords overscaling curseparallel thinkingLLM reasoningbudget predictionlatent representationssample efficiencydecoding optimization
0
0 comments X

The pith

Sample-specific budget prediction from latent states resolves the overscaling curse in parallel LLM thinking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Parallel thinking improves LLM reasoning by sampling multiple paths and aggregating results, yet current practice assigns every input the same global budget sized for peak dataset accuracy. Many samples reach their best performance with far fewer paths, so the shared budget wastes computation on the majority of cases. The paper identifies this mismatch as the overscaling curse and demonstrates that model latent representations already encode enough signal to forecast the minimal budget each sample needs. Predicting these budgets in advance raises overall utilization while holding dataset accuracy steady and enables pre-decoding allocation that preserves parallel decoding speed.

Core claim

The central claim is that the overscaling curse arises because a single global sampling budget chosen to maximize dataset accuracy necessarily over-allocates paths to many individual samples whose accuracy saturates earlier; this contradiction between system efficacy and sample efficiency can be broken by a Latent Budget Predictor (LanBo) that reads latent representations to assign sample-specific budgets, thereby improving utilization without accuracy loss and supporting a Pre-decoding Budget Adaptation (PreAda) scheme that allocates budgets before decoding begins.

What carries the argument

Latent Budget Predictor (LanBo), a module that probes internal model representations to forecast the smallest number of parallel paths required for each input to reach its individual accuracy peak.

If this is right

  • Overall budget utilization rises while dataset-level accuracy stays constant.
  • Pre-decoding allocation becomes possible, preserving full parallelization during the generation phase.
  • Hardware metrics improve in both end-to-end latency and peak memory consumption.
  • The same predictor can be dropped into existing multi-path decoding pipelines without retraining the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Internal states appear to encode per-input reasoning difficulty or convergence rate, which could be exploited by other adaptive sampling schemes.
  • The method may generalize to chain-of-thought or tree-of-thought variants where path count also trades off against accuracy.
  • Production deployments could realize direct cost savings by avoiding over-sampling on the large fraction of easy inputs.
  • Combining the predictor with early-stopping rules during generation might yield further efficiency gains.

Load-bearing premise

Model latent representations already contain enough information to predict the sample-specific optimal budget accurately without any extra labeled data or tuning steps that would themselves consume the budget being saved.

What would settle it

Measure the correlation between LanBo-predicted budgets and the true minimal budgets found by exhaustive per-sample search on a held-out test set; if accuracy falls when the predicted budgets replace the oracle budgets, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2601.21619 by Rui Wang, Yiming Wang, Zhuosheng Zhang.

Figure 1
Figure 1. Figure 1: The Overscaling Curse of Parallel Thinking. When evaluating an entire dataset D in a model, a large global parallelism level ND is often used to maximize overall performance, as in Episode (ii). Under this, as in Episode (i), only type-(4) samples truly benefit, since they indeed require large N to realize substantial gains. In contrast, the other sample types do not benefit because they reach their best p… view at source ↗
Figure 2
Figure 2. Figure 2: T2: Thinking Parallelism Before Parallel Thinking. We introduce trainable layer-wise estimators that predict the optimal parallelism level for each input from its final-token representations. These estimators are first trained, and each is assigned a weight based on its layer-wise validation error. During inference, after encoding the input, the layer-weighted parallelism estimate Nˆ ∗ is obtained, and the… view at source ↗
Figure 3
Figure 3. Figure 3: OverScaling Index MD across models and datasets, with detailed (N ∗ D/ND) labeled below each value. first to maximize performance regardless of computational cost and then to choose the smallest N that attains this maximum to minimize redundancy. Therefore, for a sample (x, y) ∼ PD, its sample-optimal parallelism level N∗ x is N ∗ x = min argmax N∈[1,Nmax] [Ax(N)]! . (3) We denote N∗ D = E(x,y)∼PDi [N∗ x] … view at source ↗
Figure 4
Figure 4. Figure 4: Proportion of the five sample types across datasets in Qwen3-4B. Results of other models are shown in Appendix C.3 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Estimation Results of Layer-wise Estimators. Each estimator is trained over 8 runs. Points indicate the mean, while the shaded areas indicate the standard deviation. Datasets with blue lines denote in-domain, and red lines denote out-of-domain datasets. with the lowest validation error, denoted by L ′ , and compute Nˆ ∗ x = ϕθL′  h (L′ ) T (x)  · Nmax. (14) as the final estimate. This strategy considers … view at source ↗
Figure 6
Figure 6. Figure 6: Examples of the “cost-accuracy” function Ax(N) for Type-(3) samples from Qwen2.5-7B on the MATH500 dataset. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of the “cost-accuracy” function Ax(N) for Type-(5) samples from Qwen2.5-7B on the MATH500 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.05 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=19, Acc=0.41 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=1, Acc=0.03 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=1, Acc=0.04 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.12 Parallelism Level N (Computati… view at source ↗
Figure 9
Figure 9. Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of the “cost-accuracy” function Ax(N) for Type-(4) samples from Qwen2.5-7B on the AIME24 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=39, Acc=0.18 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=46, Acc=0.22 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=16, Acc=0.08 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=10, Acc=0.16 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.26 Parallelism Level N (Comput… view at source ↗
Figure 11
Figure 11. Figure 11: Examples of the “cost-accuracy” function Ax(N) for Type-(3) samples from Qwen2.5-7B on the AIME25 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=122, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=39, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=54, Acc=1.00 Parallelism Level N (Computational Cost) Sample Accuracy [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of the “cost-accuracy” function Ax(N) for Type-(4) samples from Qwen2.5-7B on the AIME25 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=10, Acc=0.33 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=69, Acc=0.42 Parallelism Level N (Computational Cost) Sample Accuracy [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples of the “cost-accuracy” function Ax(N) for Type-(3) samples from Qwen3-4B on the AIME24 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=24, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=70, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=36, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=21, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=17, Acc=1.00 Parallelism Level N (Computa… view at source ↗
Figure 14
Figure 14. Figure 14: Examples of the “cost-accuracy” function Ax(N) for Type-(4) samples from Qwen3-4B on the AIME24 dataset. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples of the “cost-accuracy” function Ax(N) for Type-(3) samples from Qwen3-4B on the AIME25 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=121, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=106, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=96, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=119, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=88, Acc=1.00 Parallelism Level N (Comp… view at source ↗
Figure 16
Figure 16. Figure 16: Examples of the “cost-accuracy” function Ax(N) for Type-(4) samples from Qwen3-4B on the AIME25 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.25 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.16 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.02 Parallelism Level N (Computational Cost) Sample Accuracy [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Examples of the “cost-accuracy” function Ax(N) for Type-(5) samples from Qwen3-4B on the AIME25 dataset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Proportion of the five sample types across datasets in Qwen2.5-7B. MATH500 AMC AIME24 AIME25 GPQA MMLU-Pro 0.00 0.25 0.50 0.75 1.00 Proportion 0.05 0.00 0.00 0.00 0.02 0.06 0.12 0.25 0.57 0.73 0.39 0.36 0.18 0.12 0.10 0.17 0.20 0.25 0.52 0.23 0.20 0.10 0.27 0.29 0.13 0.40 0.13 0.00 0.12 0.04 (a) (b) (c) (d) (e) [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Proportion of the five sample types across datasets in Llama3.1-8B. MATH500 AMC AIME24 AIME25 GPQA MMLU-Pro 0.00 0.25 0.50 0.75 1.00 Proportion 0.71 0.37 0.20 0.17 0.25 0.39 0.02 0.10 0.10 0.20 0.12 0.06 0.02 0.12 0.07 0.13 0.18 0.11 0.24 0.40 0.60 0.47 0.43 0.51 0.01 0.01 0.03 0.03 0.02 0.03 (a) (b) (c) (d) (e) [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Proportion of the five sample types across datasets in Deepseek-R1-Distill-Qwen-7B. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Results under varying scaling factor r of MLP hidden size ⌊rd⌋ on the MATH500 dataset. 1/32 1/16 1/8 1/4 1/2 10.10 10.15 10.20 10.25 10.30 10.35 10.40 10.45 Qwen2.5-7B 1/32 1/16 1/8 1/4 1/2 6.15 6.20 6.25 6.30 6.35 6.40 6.45 6.50 Llama3.1-8B 1/32 1/16 1/8 1/4 1/2 68.00 68.05 68.10 68.15 68.20 68.25 68.30 68.35 Deepseek-R1-Distill-Qwen-7B 1/32 1/16 1/8 1/4 1/2 53.95 54.00 54.05 54.10 54.15 54.20 54.25 54.3… view at source ↗
Figure 22
Figure 22. Figure 22: Results under varying scaling factor r of MLP hidden size ⌊rd⌋ on the AIME25 dataset. Training Data Size. We also study the effect of the estimator’s training data size. In our main experiments, the estimator is trained on 5k samples; we vary this size to evaluate its effect on T2’s performance [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Results under varying training data sizes for estimators on the MATH500 dataset. 1 2 3 4 5 7.5 10 10.10 10.15 10.20 10.25 10.30 10.35 10.40 10.45 Qwen2.5-7B 1 2 3 4 5 7.5 10 6.10 6.15 6.20 6.25 6.30 6.35 6.40 6.45 Llama3.1-8B 1 2 3 4 5 7.5 10 68.10 68.15 68.20 68.25 68.30 68.35 68.40 68.45 Deepseek-R1-Distill-Qwen-7B 1 2 3 4 5 7.5 10 53.90 53.95 54.00 54.05 54.10 54.15 54.20 54.25 Qwen3-4B Training Data S… view at source ↗
Figure 24
Figure 24. Figure 24: Results under varying training data sizes for estimators on the AIME25 dataset. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗
read the original abstract

Parallel thinking improves LLM reasoning through multi-path sampling and aggregation. In standard evaluations, due to a lack of sample-specific priors, all samples share a global budget chosen to maximize dataset accuracy. However, many samples reach their best accuracy with much smaller budgets, causing low budget utilization. This contradiction between system efficacy and sample efficiency constitutes the Overscaling Curse. In this paper, we first provide a formal analysis of the overscaling curse and quantify its prevalence and severity in real-world systems. To break it, we propose Latent Budget Predictor (LanBo), which probes model latent representations to predict sample-specific optimal budgets. LanBo significantly improves budget utilization while maintaining dataset accuracy. We further integrate LanBo into the full decoding pipeline, inspiring Pre-decoding Budget Adaptation (PreAda), a paradigm that allocates budgets before decoding to preserve decoding-time parallelization. LanBo substantially improves hardware-aware efficiency in latency and memory, demonstrating both its practical value and the promise of LanBo for efficient parallel decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies the 'Overscaling Curse' in parallel thinking for LLMs, where a global budget for multi-path sampling maximizes dataset accuracy but yields low utilization because many samples saturate at much smaller per-sample budgets. It proposes the Latent Budget Predictor (LanBo) that probes model latent representations to predict sample-specific optimal budgets, integrates this into Pre-decoding Budget Adaptation (PreAda) to allocate budgets before decoding, and reports gains in budget utilization, latency, and memory while preserving dataset accuracy.

Significance. If the empirical claims hold, the work offers a practical route to reconcile system-level and sample-level efficiency in parallel reasoning methods such as self-consistency. The formal analysis of the curse, the pre-decoding paradigm, and hardware-aware metrics constitute clear strengths that could influence efficient deployment of multi-path techniques.

major comments (2)
  1. [§3.2] §3.2 (LanBo training protocol): the supervision signal for optimal per-sample budgets is not specified. If labels are obtained via exhaustive per-sample sweeps over budget values, the pre-computation cost must be measured and shown to be amortized or negligible; otherwise the net efficiency gain is unclear.
  2. [§5.1, Table 2] §5.1, Table 2 (utilization results): the reported utilization improvement depends on LanBo prediction accuracy, yet no error analysis (e.g., fraction of over- or under-predictions, MAE on budget) is provided. Without this, it is impossible to determine whether gains arise from latent information or from test-set characteristics.
minor comments (2)
  1. [Figure 3] Figure 3 (latency/memory plots): axis labels and legend entries are too small for readability; enlarge fonts and add error bars if multiple runs were performed.
  2. [§2.1] Notation: the symbol B* for optimal budget is introduced without an explicit equation; add a short definition in §2.1 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our work. We address the major comments point by point below, and we will incorporate the suggested clarifications and additional analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (LanBo training protocol): the supervision signal for optimal per-sample budgets is not specified. If labels are obtained via exhaustive per-sample sweeps over budget values, the pre-computation cost must be measured and shown to be amortized or negligible; otherwise the net efficiency gain is unclear.

    Authors: We appreciate this observation. The supervision signal for LanBo is derived from per-sample sweeps on a validation set to determine the smallest budget achieving maximum accuracy for each sample. We acknowledge that the pre-computation cost was not explicitly quantified in the original submission. In the revision, we will report the time required for these sweeps and demonstrate that it is amortized over repeated use of the model on similar data distributions, leading to net efficiency gains. We will also discuss how this cost compares to the savings in inference time. revision: yes

  2. Referee: [§5.1, Table 2] §5.1, Table 2 (utilization results): the reported utilization improvement depends on LanBo prediction accuracy, yet no error analysis (e.g., fraction of over- or under-predictions, MAE on budget) is provided. Without this, it is impossible to determine whether gains arise from latent information or from test-set characteristics.

    Authors: We agree that providing an error analysis is important to substantiate the source of the gains. In the revised manuscript, we will add an analysis including the Mean Absolute Error (MAE) between predicted and optimal budgets, the percentages of over-predictions and under-predictions, and an ablation study comparing LanBo to random budget assignment. This will clarify that the improvements are due to the predictive power of the latent representations rather than inherent properties of the test set. revision: yes

Circularity Check

0 steps flagged

No circularity: LanBo prediction from latents is independent of the accuracy metric it targets

full rationale

The paper's core claim is a formal analysis of the overscaling curse followed by a proposal to predict per-sample budgets from existing model latent representations. No equations in the provided text define the predictor output in terms of the accuracy or utilization it is meant to improve, nor do any steps reduce a 'prediction' to a fitted input by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming is smuggled through citations. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review limits visibility into parameters and assumptions. No explicit free parameters, axioms, or invented entities beyond the two named methods are described.

invented entities (2)
  • Latent Budget Predictor (LanBo) no independent evidence
    purpose: Predict sample-specific optimal sampling budgets from model latent representations
    New component introduced to address the overscaling issue
  • Pre-decoding Budget Adaptation (PreAda) no independent evidence
    purpose: Allocate per-sample budgets before decoding begins to preserve parallelization
    New paradigm that integrates LanBo into the decoding pipeline

pith-pipeline@v0.9.0 · 5474 in / 1366 out tokens · 18655 ms · 2026-05-16T10:35:00.539571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 18 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms

    Aggarwal, P., Madaan, A., Yang, Y ., et al. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396,

  3. [3]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V ., Ré, C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,

  4. [4]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  5. [5]

    Parallel scaling law for language models

    Chen, M., Hui, B., Cui, Z., Yang, J., Liu, D., Sun, J., Lin, J., and Liu, Z. Parallel scaling law for language models. arXiv preprint arXiv:2505.10475, 2025a. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al. Do not think that much for 2+ 3=? on the overthinking of long reasoning models. InForty-second I...

  6. [6]

    Dong, H., Brandfonbrener, D., Helenowski, E., He, Y ., Ku- mar, M., Fang, H., Chi, Y ., and Sankararaman, K. A. Generalized parallel scaling with interdependent genera- tions.arXiv preprint arXiv:2510.01143, 2025a. Dong, Z., Zhou, Z., Liu, Z., Yang, C., and Lu, C. Emergent response planning in llms.arXiv preprint arXiv:2502.06258, 2025b. Fan, A., Lewis, M...

  7. [7]

    Deep Think with Confidence

    Fu, Y ., Wang, X., Tian, Y ., and Zhao, J. Deep think with confidence.arXiv preprint arXiv:2508.15260,

  8. [8]

    Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., and Goodman, N. D. Cognitive behaviors that enable self- improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

  9. [9]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [10]

    L., Liu, Y ., Shang, N., Sun, Y ., Zhu, Y ., Yang, F., and Yang, M

    Guan, X., Zhang, L. L., Liu, Y ., Shang, N., Sun, Y ., Zhu, Y ., Yang, F., and Yang, M. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

  11. [11]

    DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    He, Z., Liang, T., Xu, J., Liu, Q., Chen, X., Wang, Y ., Song, L., Yu, D., Liang, Z., Wang, W., et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456,

  12. [12]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  13. [13]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  14. [14]

    The Curious Case of Neural Text Degeneration

    Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y . The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751,

  15. [15]

    C., Choukse, E., and Ustiugov, D

    Hong, C., Guo, X., Singh, A. C., Choukse, E., and Ustiugov, D. Slim-sc: Thought pruning for efficient scaling with self-consistency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 34488–34505,

  16. [16]

    Scalable best-of-n selec- tion for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

    Kang, Z., Zhao, X., and Song, D. Scalable best-of-n selec- tion for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

  17. [17]

    Parallelmuse: Agen- tic parallel thinking for deep information seeking.arXiv preprint arXiv:2510.24698, 2025a

    Li, B., Zhang, D., Wu, J., Yin, W., Tao, Z., Zhao, Y ., Zhang, L., Shen, H., Fang, R., Xie, P., et al. Parallelmuse: Agen- tic parallel thinking for deep information seeking.arXiv preprint arXiv:2510.24698, 2025a. Li, Y ., Yuan, P., Feng, S., Pan, B., Wang, X., Sun, B., Wang, H., and Li, K. Escape sky-high cost: Early-stopping self-consistency for multi-s...

  18. [18]

    Treepo: Bridging the gap of policy optimization and efficacy and inference effi- ciency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025b

    Li, Y ., Gu, Q., Wen, Z., Li, Z., Xing, T., Guo, S., Zheng, T., Zhou, X., Qu, X., Zhou, W., et al. Treepo: Bridging the gap of policy optimization and efficacy and inference effi- ciency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025b. Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y ., Wang, R., Yang, Y ., Shi, S., and Tu, Z. Enco...

  19. [19]

    Muennighoff, N., Yang, Z., Shi, W., Li, X

    URL https://maa.org/ maa-invitational-competitions/. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. B. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286– 20332,

  20. [20]

    Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337,

    Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., and Wang, Y . Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337,

  21. [21]

    Hogwild! inference: Parallel llm generation via concurrent attention.arXiv preprint arXiv:2504.06261,

    Rodionov, G., Garipov, R., Shutova, A., Yakushev, G., Schultheis, E., Egiazarian, V ., Sinitsin, A., Kuznedelev, D., and Alistarh, D. Hogwild! inference: Parallel llm generation via concurrent attention.arXiv preprint arXiv:2504.06261,

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  23. [23]

    Layer by Layer: Uncovering Hidden Representations in Language Models

    Skean, O., Arefin, M. R., Zhao, D., Patel, N., Naghiyev, J., LeCun, Y ., and Shwartz-Ziv, R. Layer by layer: Uncov- ering hidden representations in language models.arXiv preprint arXiv:2502.02013,

  24. [24]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters.arXiv preprint arXiv:2408.03314,

  25. [25]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  26. [26]

    Inter- pretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845, 2024a

    Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang, T. Inter- pretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845, 2024a. Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. Improving text embeddings with large language models. InProceedings of the 62nd Annual Meeting of the Associa...

  27. [27]

    Make every penny count: Difficulty-adaptive self-consistency for cost-efficient rea- soning

    10 Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking Wang, X., Feng, S., Li, Y ., Yuan, P., Zhang, Y ., Tan, C., Pan, B., Hu, Y ., and Li, K. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient rea- soning. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 6904–6917, 2025a....

  28. [28]

    Sequence-to-Sequence Learning as Beam-Search Optimization

    Wiseman, S. and Rush, A. M. Sequence-to-sequence learning as beam-search optimization.arXiv preprint arXiv:1606.02960,

  29. [29]

    Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

    Wu, T., Liu, Y ., Bai, J., Jia, Z., Zhang, S., Lin, Z., Wang, Y ., Zhu, S.-C., and Zheng, Z. Native parallel reasoner: Reasoning in parallelism via self-distilled reinforcement learning.arXiv preprint arXiv:2512.07461,

  30. [30]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Xiong, M., Hu, Z., Lu, X., Li, Y ., Fu, J., He, J., and Hooi, B. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,

  31. [31]

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

  32. [32]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, X., An, Y ., Liu, H., Chen, T., and Chen, B. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991, 2025b. Yao, S., Yu,...

  33. [33]

    Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

    Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

  34. [34]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  35. [35]

    Pushing test-time scaling limits of deep search with asymmetric verification.arXiv preprint arXiv:2510.06135,

    Zeng, W., He, K., Kuang, C., Li, X., and He, J. Pushing test-time scaling limits of deep search with asymmetric verification.arXiv preprint arXiv:2510.06135,

  36. [36]

    Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b.arXiv preprint arXiv:2406.07394,

    Zhang, D., Huang, X., Zhou, D., Li, Y ., and Ouyang, W. Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b.arXiv preprint arXiv:2406.07394,

  37. [37]

    Parallel- r1: Towards parallel thinking via reinforcement learning

    Zheng, T., Zhang, H., Yu, W., Wang, X., Dai, R., Liu, R., Bao, H., Huang, C., Huang, H., and Yu, D. Parallel- r1: Towards parallel thinking via reinforcement learning. arXiv preprint arXiv:2509.07980,

  38. [38]

    Related Work A.1

    11 Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking A. Related Work A.1. Parallel Thinking Parallel thinking is a test-time scaling paradigm that mainly consists of two stages:exploratory samplingandanswer generation(Li et al., 2025a). In the first stage, the most brute-force strategy is stochastic methods, where each reasonin...

  39. [39]

    increases the effective lookahead by maintaining multiple hypotheses at each decoding step. Other strategies, such as Tree-of-Thought (ToT) (Yao et al., 2023), Skeleton-of-Thought (SoT) (Ning et al., 2023), and Monte Carlo Tree Search (MCTS) (Zhang et al., 2024; Guan et al., 2025; Li et al., 2025b; Ding et al., 2025), build on stochastic rollouts or struc...

  40. [40]

    This paradigm follows two main research directions (Muennighoff et al., 2025):sequential scalingandparallel scaling

    aims to enhance reasoning performance by increasing computation at inference time. This paradigm follows two main research directions (Muennighoff et al., 2025):sequential scalingandparallel scaling. Sequential scaling focuses on extending the length of a single chain-of-thought to induce slower, more deliberate thinking, thereby eliciting cognitive mecha...

  41. [41]

    cost-accuracy

    that increase the likelihood of reaching the correct answer, through RL (Shao et al., 2024; Guo et al., 2025; Yu et al., 2025), SFT (Muennighoff et al., 2025; Ye et al., 2025; Yang et al., 2025a), or inference-time prompt forcing (Muennighoff et al., 2025; Wang et al., 2025c). Parallel scaling corresponds to parallel thinking, which is discussed in detail...

  42. [42]

    In the original paper, w= 4 , k= 32 , and L= 40

    In Stage 3, using the stopping point as a threshold, DSC draws one sample for each easier question, while for harder questions it adaptively increases the budget by doubling the number of w-sample blocks, up to a maximum of L samples. In the original paper, w= 4 , k= 32 , and L= 40 . We adopt the same settings in our implementation. DeepConf (Fu et al., 2...