Uncertainty-Aware Budget Allocation for Adaptive Test-Time Reasoning

Hung Le; Manh Nguyen; Sunil Gupta

arxiv: 2605.26849 · v1 · pith:PU5DYDXEnew · submitted 2026-05-26 · 💻 cs.CL

Uncertainty-Aware Budget Allocation for Adaptive Test-Time Reasoning

Manh Nguyen , Sunil Gupta , Hung Le This is my paper

Pith reviewed 2026-06-29 18:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords uncertainty-aware budget allocationtest-time reasoningsampling budget allocationlanguage model reasoningnegative log-likelihoodmarginal-greedy algorithmadaptive compute allocationconcave optimization

0 comments

The pith

Allocating extra samples to high-uncertainty questions via marginal-greedy on single-generation negative log-likelihood raises reasoning accuracy over uniform sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Uncertainty-Aware Budget Allocation to fix the inefficiency of giving every question the same number of samples during test-time reasoning. It runs one generation on each question, extracts average negative log-likelihood directly from the model's output probabilities as a difficulty signal, and then uses a marginal-greedy solver on a concave coverage objective to hand the remaining budget to the most uncertain questions. Because the first sample already contributes to the final answer vote, the method adds no auxiliary models or extra forward passes. Experiments across six models and five benchmarks show consistent gains, largest when total samples per question are small. The central mechanism therefore turns a fixed compute budget into higher accuracy by matching sample count to per-question uncertainty.

Core claim

UAB is a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1 every question receives one generation whose average negative log-likelihood serves as the difficulty signal while also counting toward the final vote. In Phase 2 the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly, giving more samples to uncertain questions and fewer to confident ones.

What carries the argument

Marginal-greedy algorithm solving the concave coverage-maximization surrogate for budget allocation driven by average negative log-likelihood signals from the first generation.

If this is right

Accuracy improves by up to 3 percent on average and 5 percent on individual benchmarks relative to uniform allocation.
Gains are largest in low-resource regimes where total samples per question are small.
The approach applies to both open-weight and black-box models from 1.5 B to 27 B parameters.
No auxiliary model or second LLM call is required because the uncertainty signal comes from the first generation itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ANLL signal could be used to decide when to stop sampling early for very confident questions, freeing budget for other questions.
Combining UAB with existing chain-of-thought or self-consistency variants might compound the accuracy lift.
If the concave surrogate is a good proxy for actual accuracy gain, similar marginal-greedy allocation could apply to other adaptive test-time techniques such as tree search or tool use.

Load-bearing premise

The average negative log-likelihood from one generation is a reliable signal of how much additional sampling that question still needs.

What would settle it

On a new reasoning benchmark, run uniform sampling versus UAB with the same total budget per question and observe whether the accuracy ordering reverses or stays flat.

Figures

Figures reproduced from arXiv: 2605.26849 by Hung Le, Manh Nguyen, Sunil Gupta.

**Figure 1.** Figure 1: Uniform vs. Uncertainty-Aware Budget (UAB) allocation. Under total budget B=12 across three questions, Uniform allocates N=4 per question, wasting samples on the easy question (67% accuracy). UAB reallocates the same budget by per-question confidence, achieving 100% accuracy at no extra cost. yields a smaller accuracy gain than the previous one, and this marginal value varies sharply across questions (Bro… view at source ↗

**Figure 2.** Figure 2: Average accuracy (%) vs. per-question budget [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Minimum budget N to first reach a given accuracy target (Qwen2.5-1.5B); lower is better. Curves invert a PCHIP interpolation of the accuracy–budget relationship. Results for Llama3.2-3B are in Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Budget allocation and ANLL–correctness correlation on MATH-500. Left: average allocated samples per ANLL decile. Right: average correctness per ANLL decile (Pearson r ≈ −0.29). L1 L2 L3 L4 L5 Difficulty level 0 1 2 3 4 5 Avg. allocated samples Qwen2.5-1.5B L1 L2 L3 L4 L5 Difficulty level 0 1 2 3 4 5 6 Llama3.2-3B LLM-Judge Uniform UAB (Ours) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Average allocated samples vs. MATH-500 an [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy vs. temperature T ∈ {0.1, 0.2, 0.5, 1, 2} across all benchmarks and both models [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Budget N required to reach a given accuracy target (Llama3.2-3B). Same format as [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over-sampled while hard questions remain under-explored. We propose Uncertainty-Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log-likelihood (ANLL), extracted directly from output log-probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open-weight and black-box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low-resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at https://github.com/manhitv/UAB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UAB is a straightforward two-phase allocator that spends one sample to get ANLL then uses marginal-greedy on the rest, and the reported gains look real on the tested setups, but the paper never shows that ANLL actually ranks questions by expected accuracy lift from extra samples.

read the letter

The core idea is simple and practical. Run one generation per question, pull ANLL from its log-probs as a difficulty score, then solve a concave coverage problem with marginal-greedy to hand out the remaining budget. No extra model, no extra calls, works on black-box APIs. They show +3% average accuracy lift and up to +5% on single benchmarks across six models and five tasks, with bigger wins when the total budget is tight.

What stands out is that the allocation step is presented as exactly solvable rather than approximated, and the code is released. That makes the method easy to reproduce and test. The experiments cover both open-weight and closed models from 1.5B to 27B, which is better than the usual single-model setup.

The soft spot is the missing link between ANLL and actual marginal benefit. The method assumes the first-sample ANLL tells you which questions will improve most from extra votes, but nothing in the reported results checks that correlation directly. If ANLL mostly reflects surface fluency instead of reasoning error, the reallocation collapses toward uniform and the gains should shrink or vanish. The abstract gives aggregate numbers but no per-question breakdown, no ablation swapping ANLL for another signal, and no error analysis on the difficulty ranking itself.

This is useful for anyone already doing majority-vote or self-consistency at test time and looking for a cheap way to tilt the budget. It is not a theoretical advance and does not change how we think about uncertainty in LLMs, but the implementation is clean enough that a referee could check the claims without much trouble. I would send it out for review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Uncertainty-Aware Budget Allocation (UAB), a two-phase procedure for adaptive test-time sampling in LLM reasoning. Phase 1 assigns one generation to each question and extracts its average negative log-likelihood (ANLL) directly from output log-probabilities as a difficulty signal; Phase 2 reallocates the remaining fixed budget via a marginal-greedy algorithm that exactly solves a concave coverage-maximization surrogate. Across six models (1.5B–27B parameters) and five reasoning benchmarks, UAB is reported to improve average accuracy by up to +3% and individual-benchmark accuracy by up to +5%, with largest gains in low-resource regimes and no auxiliary models or extra LLM calls required. Code is released publicly.

Significance. If the empirical claims hold, the work supplies a practical, zero-overhead technique for improving reasoning performance by reallocating a fixed sampling budget according to intrinsic model uncertainty. The public code release supports reproducibility and enables direct follow-up experiments.

major comments (2)

[Abstract, Phase 1 description] Abstract, Phase 1 description: the central claim that ANLL from a single generation serves as a reliable per-question difficulty signal for the marginal-greedy allocator rests on an unverified assumption that ANLL ranks questions by expected coverage improvement from additional samples. No correlation analysis, ablation against alternative signals, or direct validation against per-question accuracy gains is described, leaving the reported accuracy improvements dependent on an unexamined experimental choice.
[Abstract, Evaluation paragraph] Abstract, Evaluation paragraph: the stated gains (+3% average accuracy, +5% on individual benchmarks) are presented without accompanying experimental details such as standard deviations across runs, number of trials, statistical significance tests, or ablations that isolate the marginal-greedy allocator from uniform sampling or the initial sample, making it impossible to assess whether the performance delta is robust or attributable to the proposed method.

minor comments (1)

[Abstract] The abstract refers to a 'concave integer optimization framework' and 'concave coverage-maximization surrogate' without stating the precise objective function or the exact form of the marginal-greedy update rule; a short equation or pseudocode excerpt would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the validation of the ANLL signal and the reporting of experimental results. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract, Phase 1 description] Abstract, Phase 1 description: the central claim that ANLL from a single generation serves as a reliable per-question difficulty signal for the marginal-greedy allocator rests on an unverified assumption that ANLL ranks questions by expected coverage improvement from additional samples. No correlation analysis, ablation against alternative signals, or direct validation against per-question accuracy gains is described, leaving the reported accuracy improvements dependent on an unexamined experimental choice.

Authors: We agree that the manuscript does not present correlation analysis between ANLL and per-question accuracy gains, nor ablations against alternative signals such as output entropy or self-consistency variance. The choice of ANLL is motivated by its zero-cost extraction from the single generation's log-probabilities and its direct relation to model uncertainty. In revision we will add a dedicated analysis section reporting Pearson/Spearman correlations of ANLL with observed accuracy improvements from additional samples, plus ablations replacing ANLL with entropy-based and variance-based signals while keeping the marginal-greedy allocator fixed. revision: yes
Referee: [Abstract, Evaluation paragraph] Abstract, Evaluation paragraph: the stated gains (+3% average accuracy, +5% on individual benchmarks) are presented without accompanying experimental details such as standard deviations across runs, number of trials, statistical significance tests, or ablations that isolate the marginal-greedy allocator from uniform sampling or the initial sample, making it impossible to assess whether the performance delta is robust or attributable to the proposed method.

Authors: The referee correctly notes the absence of these details. The current manuscript reports point estimates only. In the revision we will expand the evaluation section to report mean accuracy and standard deviation over five independent runs, include paired t-test p-values against baselines, and add explicit ablations that (i) replace the marginal-greedy allocator with uniform allocation of the remaining budget and (ii) compare full UAB against using only the Phase-1 sample. These changes will allow readers to assess robustness and isolate the contribution of the allocator. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper extracts ANLL directly from the single generation's output log-probabilities in Phase 1 and applies a marginal-greedy algorithm that solves the concave coverage-maximization surrogate exactly in Phase 2. No step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled from prior work; the surrogate objective is independent of the final accuracy metric, and gains are measured on external benchmarks. The derivation is therefore self-contained against the stated inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that ANLL from one generation is a valid difficulty proxy and that the marginal-greedy algorithm exactly solves the concave surrogate; both are introduced in the paper rather than derived from prior literature.

axioms (1)

domain assumption The marginal-greedy algorithm solves the concave coverage-maximization surrogate exactly
Stated in the abstract description of Phase 2

pith-pipeline@v0.9.1-grok · 5736 in / 1319 out tokens · 31432 ms · 2026-06-29T18:26:48.062768+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Bennett Fox

Optimal self-consistency for efficient rea- soning with large language models.arXiv preprint arXiv:2511.12309. Bennett Fox. 1966. Discrete optimization via marginal analysis.Management science, 13(3):210–216. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

work page internal anchor Pith review arXiv 1966
[2]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Ja- cob Steinhardt. 2021. Measuring mathematical prob- lem solving with the MATH dataset.arXiv preprint arXiv:2103.03874. Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[3]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.Transac- tions on Machine Learning Research, 2024. Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucina- ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

For any feasible N and each question i, the telescoping identity (empty sums equal zero) gives: fi(Ni)−f i(N ∗ i ) = Ni−1X n=N ∗ i ∆i(n)− N ∗ i −1X n=Ni ∆i(n). (9) To see this: if Ni > N ∗ i , the first sum telescopes fi(N ∗ i )→f i(Ni) and the second is empty; if Ni < N ∗ i , the second sum telescopes fi(Ni)→ fi(N ∗ i ) with a sign flip and the first is ...

1966
[5]

Task” describes the evaluation format; “Size

quantifies disagreement among multiple stochastic samples from the model. Given a ques- tion xi, we draw K independent Phase-1 samples, parse each into a final answer, and compute the Shannon entropy of the empirical answer distri- bution: H(x i) =− P v∈Vi qv lnq v, where qv is 12 Table 5: Wall-clock time (s) per full-dataset run at bud- getN=4.Bold= fast...

2023

[1] [1]

Bennett Fox

Optimal self-consistency for efficient rea- soning with large language models.arXiv preprint arXiv:2511.12309. Bennett Fox. 1966. Discrete optimization via marginal analysis.Management science, 13(3):210–216. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

work page internal anchor Pith review arXiv 1966

[2] [2]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Ja- cob Steinhardt. 2021. Measuring mathematical prob- lem solving with the MATH dataset.arXiv preprint arXiv:2103.03874. Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[3] [3]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.Transac- tions on Machine Learning Research, 2024. Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucina- ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

For any feasible N and each question i, the telescoping identity (empty sums equal zero) gives: fi(Ni)−f i(N ∗ i ) = Ni−1X n=N ∗ i ∆i(n)− N ∗ i −1X n=Ni ∆i(n). (9) To see this: if Ni > N ∗ i , the first sum telescopes fi(N ∗ i )→f i(Ni) and the second is empty; if Ni < N ∗ i , the second sum telescopes fi(Ni)→ fi(N ∗ i ) with a sign flip and the first is ...

1966

[5] [5]

Task” describes the evaluation format; “Size

quantifies disagreement among multiple stochastic samples from the model. Given a ques- tion xi, we draw K independent Phase-1 samples, parse each into a final answer, and compute the Shannon entropy of the empirical answer distri- bution: H(x i) =− P v∈Vi qv lnq v, where qv is 12 Table 5: Wall-clock time (s) per full-dataset run at bud- getN=4.Bold= fast...

2023