pith. sign in

arxiv: 2505.11737 · v4 · submitted 2025-05-16 · 💻 cs.LG · cs.AI· cs.CL

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Pith reviewed 2026-05-22 13:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords token-level uncertaintyLLM reasoninguncertainty estimationmathematical reasoningtest-time improvementweight perturbationself-assessment
0
0 comments X

The pith

Large language models can estimate their own uncertainty in mathematical reasoning steps by applying low-rank random weight perturbations during token decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes TokUR, a framework that lets large language models assess uncertainty in their reasoning for math problems. It adds small random changes to model weights in a low-rank manner while generating each token to create predictive distributions that measure uncertainty at the token level. These per-token values are aggregated to judge the overall reliability of a generated response. If the approach holds, models could identify likely errors and adjust their outputs at test time without retraining. This would address the inconsistency of LLM answers in multi-step reasoning tasks where trust in the result matters.

Core claim

TokUR applies low-rank random weight perturbation during LLM decoding to produce predictive distributions for each token. Token-level uncertainty estimates are then computed from these distributions and aggregated to capture semantic uncertainty in the full response. Experiments on mathematical reasoning datasets of varying difficulty show that these uncertainty signals correlate with answer correctness and model robustness, and can be used to improve reasoning performance through test-time interventions based on the uncertainty values.

What carries the argument

Low-rank random weight perturbation during decoding, which generates predictive distributions used to derive and aggregate token-level uncertainties for semantic uncertainty estimation.

If this is right

  • TokUR uncertainty scores exhibit strong correlation with answer correctness on mathematical reasoning datasets.
  • The uncertainty measures also align with indicators of model robustness across tasks of varying difficulty.
  • Uncertainty signals from TokUR can be applied to enhance reasoning performance at test time.
  • The overall approach offers a scalable method for increasing reliability in LLM reasoning without additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same perturbation approach could potentially extend to uncertainty estimation in non-mathematical reasoning domains such as code generation.
  • Combining TokUR signals with other test-time techniques like self-consistency might produce more robust error detection.
  • If the low-rank structure keeps computation modest, the method could support real-time uncertainty monitoring on large models.

Load-bearing premise

The assumption that uncertainties obtained from low-rank perturbed weight distributions during decoding meaningfully capture semantic uncertainty and correlate with actual answer correctness in a way that supports test-time improvements.

What would settle it

Testing TokUR on an additional mathematical reasoning dataset and finding no meaningful correlation between the aggregated uncertainty scores and whether the generated answers are correct.

Figures

Figures reproduced from arXiv: 2505.11737 by Dimitris Metaxas, Haizhou Shi, Hao Wang, Haoxian Chen, Hengyi Wang, Huan Zhang, Kai Xu, Ligong Han, Tunyu Zhang, Xiaoxiao He, Yibin Wang, Zhuowei Li.

Figure 1
Figure 1. Figure 1: Distribution of TokUR’s Uncertainty Scores and AUROC across Different Difficulty Levels, applied to Llama-3.2-1B-Instruct. Left: TokUR (AU, Ours); Middle: TokUR (TU, Ours); Right: TokUR (EU, Ours). Utilizing the Approximate Weight Posterior q(θ|σq). Notably, while we leverage the variational posterior formulation of Eqn. 19 to quantify uncertainty (detailed in Sec. 3.1), we use only the mean weights W0 for… view at source ↗
Figure 2
Figure 2. Figure 2: Performance on GSM8K (Left) and MATH500 (Right) when scaling up sample size N at test time of Llama-3.2-1B-Instruct. Our TokUR (AU, EU, and TU) consistently outperforms the LL baseline, particularly when N is small. Please refer to [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of responses from GSM8K (Cobbe et al., 2021) plotted in the Length Normal￾ized EU-AU uncertainty space, as quantified by our token-level uncertainty metrics (Eqn. 13). All experiments use Llama-3.2-1B-Instruct as the base model. For reference, the Pass@1 baseline accuracy (GSM8K: 44.43%; MATH500: 25.60%) is also shown as red dashed lines, high￾lighting the gains achieved through test-time scal… view at source ↗
Figure 4
Figure 4. Figure 4: Left: Uncertainty estimation with different perturbation strength σq. Right: Influence of perturbation strength on uncertainty-based AUROC scores. 0.0 0.2 0.4 0.6 0.8 Decoding Temperature 0 50 100 150 200 AU TU EU 0.0 0.2 0.4 0.6 0.8 Decoding Temperature 0.55 0.60 0.65 0.70 0.75 0.80 AUROC Score AU TU EU LL [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: Uncertainty estimations in different token decoding temperature τ . Right: Influence of token decoding temperature on uncertainty-based AUROC scores. can be used as scoring signals to distinguish between correct and incorrect samples, as described in Sec. 4.1.2. As shown in [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of stepwise posterior sampling. Comparison of stepwise vs. joint modeling on Llama-3.2-1B-Instruct across accuracy, improvement, and efficiency. Stepwise modeling consistently achieves better scaling performance, validating Assumption 3.1. normalization when computing TokUR. To assess the impact of sequence length on uncertainty estimation, we therefore conduct an ablation study on length normaliz… view at source ↗
Figure 7
Figure 7. Figure 7: Case Study (1/4): The sample is from GSM8K, whose correct answer is 2400. In the incorrect solution, the model demonstrated significant uncertainty when mistakenly reversing “9600 − 7200” as “7200 − 9600”, and also exhibited high uncertainty at the negative sign “−” in the final answer. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case Study (2/4): The sample is from GSM8K. In this example, the incorrect solution ignores the critical condition that “Figaro is 7 years older than Job,” leading to the use of 45 instead of 52 in the final calculation. Notably, the model exhibits high uncertainty at the token “45” indicating a lack of confidence in its own response at that point. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case Study (3/4): The sample is from MATH500. In this example, the incorrect solution gives its final answer “15x” in step 4. The model exhibits high uncertainty at the token next to “15x” because it overlooks the constant term. Furthermore, it can be observed that tokens associated with high uncertainty occur more frequently in the incorrect solution. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case Study (4/4): The sample is from MATH500. In this example, the model demonstrated notably high uncertainty at the incorrect answer token “60”. In the correct solution on the left, the model had low uncertainty for the correct answer “36”. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
read the original abstract

While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes TokUR, a Token-level Uncertainty estimation framework for Reasoning in LLMs. It uses low-rank random weight perturbation during decoding to generate predictive distributions for estimating token-level uncertainty, which is then aggregated to capture the semantic uncertainty of the generated responses. Through experiments on mathematical reasoning datasets of varying difficulty, it shows that TokUR has a strong correlation with answer correctness and model robustness, and that these uncertainty signals can be used to enhance the model's reasoning performance at test time.

Significance. If the central claims hold after addressing the noted concerns, TokUR could offer a scalable method for LLMs to self-assess and self-improve outputs in multi-step reasoning tasks. This would enhance reliability and interpretability without extra training, with potential impact on trustworthy deployment of LLMs in reasoning-heavy applications.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts strong correlation and test-time gains but provides no details on exact aggregation method, experimental controls, baselines, or statistical significance, leaving the central claim with limited verifiable support from the given text.
  2. [Method] Method section on low-rank perturbation: The approach of using low-rank random weight perturbation during decoding to generate predictive distributions whose aggregation captures semantic uncertainty lacks explicit comparison to higher-fidelity methods (e.g., full-rank dropout or posterior sampling); in autoregressive models this risks the scores reflecting local perturbation artifacts rather than epistemic or semantic uncertainty, especially across compounded multi-step reasoning trajectories.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'semantic uncertainty' is introduced without a formal definition or citation to related work on uncertainty quantification in LLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have addressed each major comment point by point below. Where the comments identify areas for improvement, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts strong correlation and test-time gains but provides no details on exact aggregation method, experimental controls, baselines, or statistical significance, leaving the central claim with limited verifiable support from the given text.

    Authors: We agree that the abstract would be strengthened by including more specific details. In the revised manuscript, we expand the abstract to briefly outline the aggregation procedure (mean-pooling of token-level entropies followed by a response-level threshold), note the use of standard mathematical reasoning benchmarks with fixed decoding hyperparameters, reference the primary baselines (temperature sampling and verbalized confidence), and report that all reported correlations are statistically significant (p < 0.01) under bootstrap resampling. These additions directly address the concern while remaining within abstract length constraints. revision: yes

  2. Referee: [Method] Method section on low-rank perturbation: The approach of using low-rank random weight perturbation during decoding to generate predictive distributions whose aggregation captures semantic uncertainty lacks explicit comparison to higher-fidelity methods (e.g., full-rank dropout or posterior sampling); in autoregressive models this risks the scores reflecting local perturbation artifacts rather than epistemic or semantic uncertainty, especially across compounded multi-step reasoning trajectories.

    Authors: We acknowledge the value of comparing against higher-fidelity perturbation methods. Full-rank dropout or posterior sampling incur prohibitive memory and latency costs for models with billions of parameters during autoregressive generation; our low-rank design (rank-8 updates applied only to the final few layers) was selected precisely to remain practical at scale. In the revision we add a dedicated paragraph in Section 3.2 that (i) justifies the low-rank choice with a brief complexity analysis, (ii) reports an additional ablation replacing the low-rank perturbation with temperature sampling (a common proxy for increased predictive variance), and (iii) shows that the aggregated uncertainty still correlates more strongly with final-answer correctness than per-token entropy alone. We also include a short discussion noting that any residual local artifacts are mitigated by the multi-step aggregation, which is supported by the observed alignment between uncertainty scores and overall solution accuracy rather than isolated token mistakes. revision: partial

Circularity Check

0 steps flagged

No significant circularity in TokUR derivation chain

full rationale

The paper proposes TokUR as a new framework that applies low-rank random weight perturbation during decoding to produce token-level predictive distributions, then aggregates these to estimate semantic uncertainty in LLM reasoning outputs. This is introduced as a methodological choice with subsequent empirical validation on mathematical reasoning datasets showing correlation to correctness and enabling test-time improvements. No steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the perturbation technique and aggregation are presented as independent inputs whose effectiveness is tested externally rather than assumed or renamed from prior results. The derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is limited to the abstract; no explicit free parameters, axioms, or invented entities are stated in the provided text. The low-rank perturbation is introduced as a technique but its exact parameterization is unspecified.

pith-pipeline@v0.9.0 · 5731 in / 1088 out tokens · 29032 ms · 2026-05-22T13:54:04.200268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

    cs.CL 2026-04 unverdicted novelty 7.0

    BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

  2. Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

    cs.CL 2026-05 unverdicted novelty 6.0

    DPUA is a two-phase framework that aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis while preserving task performance.

  3. Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

    cs.CL 2026-01 unverdicted novelty 6.0

    Erroneous processing heads in attention layers cause hop-generalization failures in LLMs; dynamically deactivating them at test time improves multi-step reasoning.

  4. Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

    cs.CL 2026-05 unverdicted novelty 5.0

    DPUA framework aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis via adaptive decoupled learning and GRPO-based optimization while preserving task accuracy.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 3 Pith papers · 18 internal anchors

  1. [1]

    Uncertainty quantification in fine-tuned llms using lora ensembles.arXiv preprint arXiv:2402.12264,

    Oleksandr Balabanov and Hampus Linander. Uncertainty quantification in fine-tuned llms using lora ensembles.arXiv preprint arXiv:2402.12264,

  2. [2]

    Area under the precision-recall curve: point estimates and confidence intervals

    Kendrick Boyd, Kevin H Eng, and C David Page. Area under the precision-recall curve: point estimates and confidence intervals. InMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pp. 451–466. Springer,

  3. [3]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787,

  4. [4]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  5. [5]

    arXiv preprint arXiv:2402.03744 (2024)

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744, 2024a. Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, and Wenpin Tang. Mallowspo: Fine-tune your llm with preference dispersions.arXiv preprint arXiv:2405.14953,...

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  7. [7]

    Understanding the uncertainty of llm explanations: A perspective based on reasoning topology.arXiv preprint arXiv:2502.17026,

    Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, and Hua Wei. Understanding the uncertainty of llm explanations: A perspective based on reasoning topology.arXiv preprint arXiv:2502.17026,

  8. [8]

    Rainproof: An umbrella to shield text generator from out-of-distribution data

    Maxime Darrin, Pablo Piantanida, and Pierre Colombo. Rainproof: An umbrella to shield text generator from out-of-distribution data. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5831–5857,

  9. [9]

    From calibration to collaboration: Llm uncertainty quantification should be more human-centered.arXiv preprint arXiv:2506.07461,

    Siddartha Devic, Tejas Srinivasan, Jesse Thomason, Willie Neiswanger, and Vatsal Sharan. From calibration to collaboration: Llm uncertainty quantification should be more human-centered.arXiv preprint arXiv:2506.07461,

  10. [10]

    Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

    11 Published as a conference paper at ICLR 2026 Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Lingu...

  11. [11]

    Beam Search Strategies for Neural Machine Translation

    Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation.arXiv preprint arXiv:1702.01806,

  12. [12]

    Deep Think with Confidence

    Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv preprint arXiv:2508.15260,

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  14. [14]

    rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  16. [16]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654,

  17. [17]

    Decomposing uncertainty for large language models through input clarification ensembling.arXiv preprint arXiv:2311.08718,

    Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. Decomposing uncertainty for large language models through input clarification ensembling.arXiv preprint arXiv:2311.08718,

  18. [18]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    URL https://openreview.net/forum? id=nZeVKeeFYf9. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

  19. [19]

    Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

    12 Published as a conference paper at ICLR 2026 Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

  20. [20]

    Large language models must be taught to know what they don’t know,

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don’t know.arXiv preprint arXiv:2406.08391,

  21. [21]

    Position: Uncertainty quantification needs reassessment for large-language model agents,

    Michael Kirchhof, Gjergji Kasneci, and Enkelejda Kasneci. Position: Uncertainty quantification needs reassessment for large-language model agents.arXiv preprint arXiv:2505.22655,

  22. [22]

    Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

    Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927,

  23. [23]

    arXiv preprint arXiv:2305.19187 , year=

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.arXiv preprint arXiv:2305.19187,

  24. [24]

    Uncertainty quantification for in-context learning of large language models.arXiv preprint arXiv:2402.10189,

    Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, et al. Uncertainty quantification for in-context learning of large language models.arXiv preprint arXiv:2402.10189,

  25. [25]

    arXiv preprint arXiv:2402.02392 (2024)

    Ollie Liu, Deqing Fu, Dani Yogatama, and Willie Neiswanger. Dellma: Decision making under uncertainty with large language models.arXiv preprint arXiv:2402.02392,

  26. [26]

    Uncertainty quantification and confidence calibration in large language models: A survey

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey.arXiv preprint arXiv:2503.15850,

  27. [27]

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint arXiv:2303.08896,

  28. [28]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    13 Published as a conference paper at ICLR 2026 Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processi...

  29. [29]

    Correcting Length Bias in Neural Machine Translation

    Kenton Murray and David Chiang. Correcting length bias in neural machine translation.arXiv preprint arXiv:1808.10006,

  30. [30]

    A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods

    Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods. arXiv preprint arXiv:2502.01618,

  31. [31]

    Training-free bayesianization for low-rank adapters of large language models.arXiv preprint arXiv:2412.05723,

    Haizhou Shi, Yibin Wang, Ligong Han, Huan Zhang, and Hao Wang. Training-free bayesianization for low-rank adapters of large language models.arXiv preprint arXiv:2412.05723,

  32. [32]

    Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760,

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760,

  33. [33]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3),

  34. [34]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirica...

  35. [35]

    Solving math word problems with process- and outcome-based feedback

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URL https://aclanthology.org/ 2023.emnlp-main.330. Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:...

  36. [36]

    Mutual information alleviates hallucinations in abstractive summarization

    Liam Van Der Poel, Ryan Cotterell, and Clara Meister. Mutual information alleviates hallucinations in abstractive summarization. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5956–5965,

  37. [37]

    Attention is all you need.Advances in neural information processing systems, 30,

    14 Published as a conference paper at ICLR 2026 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

  38. [38]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

  39. [39]

    Blob: Bayesian low-rank adaptation by backpropagation for large language models.arXiv preprint arXiv:2406.11675,

    Yibin Wang, Haizhou Shi, Ligong Han, Dimitris Metaxas, and Hao Wang. Blob: Bayesian low-rank adaptation by backpropagation for large language models.arXiv preprint arXiv:2406.11675,

  40. [40]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prom...

  41. [41]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,

  42. [42]

    To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

    Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári. To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

  43. [43]

    Bayesian low-rank adaptation for large language models.arXiv preprint arXiv:2308.13111,

    Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models.arXiv preprint arXiv:2308.13111,

  44. [44]

    Uncertainty-aware step-wise verification with generative reward models.arXiv preprint arXiv:2502.11250,

    Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, and Yarin Gal. Uncertainty-aware step-wise verification with generative reward models.arXiv preprint arXiv:2502.11250,

  45. [45]

    Cot-uq: Improving response-wise uncertainty quantification in llms with chain-of-thought.arXiv preprint arXiv:2502.17214,

    Boxuan Zhang and Ruqi Zhang. Cot-uq: Improving response-wise uncertainty quantification in llms with chain-of-thought.arXiv preprint arXiv:2502.17214,

  46. [46]

    Enhancing uncertainty-based hallucination detection with stronger focus.arXiv preprint arXiv:2311.13230,

    Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. Enhancing uncertainty-based hallucination detection with stronger focus.arXiv preprint arXiv:2311.13230,

  47. [47]

    Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

    15 Published as a conference paper at ICLR 2026 Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406,

  48. [48]

    A theoretical study on bridging internal probability and self-consistency for llm reasoning.arXiv preprint arXiv:2510.15444,

    Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, and Xiaoxing Ma. A theoretical study on bridging internal probability and self-consistency for llm reasoning.arXiv preprint arXiv:2510.15444,

  49. [49]

    Uncertainty-guided chain-of-thought for code generation with llms.arXiv preprint arXiv:2503.15341,

    Yuqi Zhu, Ge Li, Xue Jiang, Jia Li, Hong Mei, Zhi Jin, and Yihong Dong. Uncertainty-guided chain-of-thought for code generation with llms.arXiv preprint arXiv:2503.15341,

  50. [50]

    In Appendix B, we present the full algorithmic description of our method with low-rank weight perturbation

    16 Published as a conference paper at ICLR 2026 APPENDIX In Appendix A, we describe the role of large language models (LLMs) in our work. In Appendix B, we present the full algorithmic description of our method with low-rank weight perturbation. In Ap- pendix C, we provide detailed proofs for all propositions presented in the main paper. In Appendix D, we...

  51. [51]

    14:end while 15:Output:The set of particles in the end. Lemma C.2(Chain rule of Conditional Entropy (Cover, 1999)).Let X and Y be two random variables, then the conditional entropy of the joint distributionH(X,Y)can be decomposed as: H(X,Y) =H(X) +H(Y|X)(21) Lemma C.1 (Cover,

  52. [52]

    For Aleatoric Uncertainty (AU) and Total Uncertainty (TU) defined in Eqn

    Proof. For Aleatoric Uncertainty (AU) and Total Uncertainty (TU) defined in Eqn. 10 and Eqn. 11, both are expressed in terms of entropy. Therefore, the decomposition of sequence-level uncertainty can be directly derived using the chain rule stated in the Lemma C.2. For Epistemic Uncertainty (EU), also calledmutual informationdefined in Eqn. 12, we proceed...

  53. [53]

    4.2, we apply length normalization to TokUR to mitigate the bias introduced by varying sequence lengths

    For the test-time scaling experiments in Sec. 4.2, we apply length normalization to TokUR to mitigate the bias introduced by varying sequence lengths. In contrast, the effect of length normalization may differ in hallucination detection tasks. To investigate this, we conduct additional ablation studies in Appendix E.5.3, examining the impact of length nor...

  54. [54]

    True”, normalized by the sum of probabilities of token “True

    Prompt Example Solve the following math problem efficiently and clearly: -For simple problems (2 steps or fewer): Provide a concise solution with minimal explanation. -For complex problems (3 steps or more): Use this step-by-step format: ## Step 1: [Concise description] [Brief explanation and calculations] ## Step 2: [Concise description] [Brief explanati...

  55. [55]

    E.3 TEST-TIMESCALING VIAUNCERTAINTYESTIMATION We provide an additional visualization of the test-time scaling results in Fig

    Overall, these results provide strong additional evidence that TokUR ’s token-level uncertainty estimates maintain a robust correlation with model accuracy across diverse LLM families and parameter scales. E.3 TEST-TIMESCALING VIAUNCERTAINTYESTIMATION We provide an additional visualization of the test-time scaling results in Fig. 2 .While the complete num...

  56. [56]

    plotted in the Length Normal- ized EU-AU uncertainty space, as quantified by our token-level uncertainty metrics (Eqn. 13). All experiments use Llama-3.2-1B-Instruct as the base model. For reference, the Pass@1 baseline accuracy (GSM8K: 44.43%; MATH500: 25.60%) is also shown as red dashed lines, high- lighting the gains achieved through test-time scaling....

  57. [57]

    Building upon this algorithm, we use uncertainty as the score for each particle at each step to guide the model’s generation process

    is an inference-time scaling method for LLM reasoning (details in Appendix B). Building upon this algorithm, we use uncertainty as the score for each particle at each step to guide the model’s generation process. We set the number of particles to N = 16 and the decoding temperature to τ = 0.8. We repeat the experiments with three different random seeds to...

  58. [58]

    using Llama-3.2-1B-Instruct as the base model. Methods evaluated include log-likelihood (LL) and three variants of TokUR (TU, AU and EU) with both Maj@N and WBoN strategies.Boldfaceand underlining denote the best and the second-best performance, respectively. Dataset Score Method Number of Samples N N=16 N=32 N=64 N=128 N=256 N=512 Llama-3.2-1B-Instruct G...

  59. [59]

    In general, higher temperatures lead to more diverse responses

    E.5.2 THEEFFECT OFTOKENDECODINGTEMPERATUREτONUNCERTAINTYESTIMATION During text generation with large language models, the decoding temperature introduces uncertainty into the model’s output. In general, higher temperatures lead to more diverse responses. In this section, we investigate the relationship between decoding temperature τ and uncertainties esti...

  60. [60]

    In addition, we introduce a naive baseline,Negative Length, which uses sequence length alone as a confidence signal

    We compare TokUR with and withoutLengthNormalization (LN), along with representative baselines. In addition, we introduce a naive baseline,Negative Length, which uses sequence length alone as a confidence signal. Results.As shown in Table 9, the impact of length normalization varies significantly across methods. For bothLLand TokUR, normalization consiste...

  61. [61]

    9600 - 7200

    The visualizations are shown in Fig. 7~Fig. 10, where Aleatoric Uncertainty (AU, in RED) and Epistemic Uncertainty (EU, in GREEN) are visualized as text-heatmap. The background shading of each token corresponds to the magnitude of its uncertainty: the darker the shade, the higher the uncertainty, indicating a lower model confidence for that token. We obse...

  62. [62]

    9600−7200

    In the incorrect solution, the model demonstrated significant uncertainty when mistakenly reversing “9600−7200 ” as “7200−9600 ”, and also exhibited high uncertainty at the negative sign “−” in the final answer. 30 Published as a conference paper at ICLR 2026 Problem:Kiarrais twice as old as Bea. Job is 3 times older than Bea. Figaro is 7 years older than...

  63. [63]

    60”. In the correct solution on the left, the model had low uncertainty for the correct answer “36

    The model exhibits high uncertainty at the token next to “15x” because it overlooks the constant term. Furthermore, it can be observed that tokens associated with high uncertainty occur more frequently in the incorrect solution. 32 Published as a conference paper at ICLR 2026 Problem:In regular pentagon $FGHIJ$, extending the sides of the pentagon, as sho...