TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
Pith reviewed 2026-05-22 13:54 UTC · model grok-4.3
The pith
Large language models can estimate their own uncertainty in mathematical reasoning steps by applying low-rank random weight perturbations during token decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TokUR applies low-rank random weight perturbation during LLM decoding to produce predictive distributions for each token. Token-level uncertainty estimates are then computed from these distributions and aggregated to capture semantic uncertainty in the full response. Experiments on mathematical reasoning datasets of varying difficulty show that these uncertainty signals correlate with answer correctness and model robustness, and can be used to improve reasoning performance through test-time interventions based on the uncertainty values.
What carries the argument
Low-rank random weight perturbation during decoding, which generates predictive distributions used to derive and aggregate token-level uncertainties for semantic uncertainty estimation.
If this is right
- TokUR uncertainty scores exhibit strong correlation with answer correctness on mathematical reasoning datasets.
- The uncertainty measures also align with indicators of model robustness across tasks of varying difficulty.
- Uncertainty signals from TokUR can be applied to enhance reasoning performance at test time.
- The overall approach offers a scalable method for increasing reliability in LLM reasoning without additional training.
Where Pith is reading between the lines
- The same perturbation approach could potentially extend to uncertainty estimation in non-mathematical reasoning domains such as code generation.
- Combining TokUR signals with other test-time techniques like self-consistency might produce more robust error detection.
- If the low-rank structure keeps computation modest, the method could support real-time uncertainty monitoring on large models.
Load-bearing premise
The assumption that uncertainties obtained from low-rank perturbed weight distributions during decoding meaningfully capture semantic uncertainty and correlate with actual answer correctness in a way that supports test-time improvements.
What would settle it
Testing TokUR on an additional mathematical reasoning dataset and finding no meaningful correlation between the aggregated uncertainty scores and whether the generated answers are correct.
Figures
read the original abstract
While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TokUR, a Token-level Uncertainty estimation framework for Reasoning in LLMs. It uses low-rank random weight perturbation during decoding to generate predictive distributions for estimating token-level uncertainty, which is then aggregated to capture the semantic uncertainty of the generated responses. Through experiments on mathematical reasoning datasets of varying difficulty, it shows that TokUR has a strong correlation with answer correctness and model robustness, and that these uncertainty signals can be used to enhance the model's reasoning performance at test time.
Significance. If the central claims hold after addressing the noted concerns, TokUR could offer a scalable method for LLMs to self-assess and self-improve outputs in multi-step reasoning tasks. This would enhance reliability and interpretability without extra training, with potential impact on trustworthy deployment of LLMs in reasoning-heavy applications.
major comments (2)
- [Abstract] Abstract: The abstract asserts strong correlation and test-time gains but provides no details on exact aggregation method, experimental controls, baselines, or statistical significance, leaving the central claim with limited verifiable support from the given text.
- [Method] Method section on low-rank perturbation: The approach of using low-rank random weight perturbation during decoding to generate predictive distributions whose aggregation captures semantic uncertainty lacks explicit comparison to higher-fidelity methods (e.g., full-rank dropout or posterior sampling); in autoregressive models this risks the scores reflecting local perturbation artifacts rather than epistemic or semantic uncertainty, especially across compounded multi-step reasoning trajectories.
minor comments (1)
- [Abstract] Abstract: The phrase 'semantic uncertainty' is introduced without a formal definition or citation to related work on uncertainty quantification in LLMs.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have addressed each major comment point by point below. Where the comments identify areas for improvement, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts strong correlation and test-time gains but provides no details on exact aggregation method, experimental controls, baselines, or statistical significance, leaving the central claim with limited verifiable support from the given text.
Authors: We agree that the abstract would be strengthened by including more specific details. In the revised manuscript, we expand the abstract to briefly outline the aggregation procedure (mean-pooling of token-level entropies followed by a response-level threshold), note the use of standard mathematical reasoning benchmarks with fixed decoding hyperparameters, reference the primary baselines (temperature sampling and verbalized confidence), and report that all reported correlations are statistically significant (p < 0.01) under bootstrap resampling. These additions directly address the concern while remaining within abstract length constraints. revision: yes
-
Referee: [Method] Method section on low-rank perturbation: The approach of using low-rank random weight perturbation during decoding to generate predictive distributions whose aggregation captures semantic uncertainty lacks explicit comparison to higher-fidelity methods (e.g., full-rank dropout or posterior sampling); in autoregressive models this risks the scores reflecting local perturbation artifacts rather than epistemic or semantic uncertainty, especially across compounded multi-step reasoning trajectories.
Authors: We acknowledge the value of comparing against higher-fidelity perturbation methods. Full-rank dropout or posterior sampling incur prohibitive memory and latency costs for models with billions of parameters during autoregressive generation; our low-rank design (rank-8 updates applied only to the final few layers) was selected precisely to remain practical at scale. In the revision we add a dedicated paragraph in Section 3.2 that (i) justifies the low-rank choice with a brief complexity analysis, (ii) reports an additional ablation replacing the low-rank perturbation with temperature sampling (a common proxy for increased predictive variance), and (iii) shows that the aggregated uncertainty still correlates more strongly with final-answer correctness than per-token entropy alone. We also include a short discussion noting that any residual local artifacts are mitigated by the multi-step aggregation, which is supported by the observed alignment between uncertainty scores and overall solution accuracy rather than isolated token mistakes. revision: partial
Circularity Check
No significant circularity in TokUR derivation chain
full rationale
The paper proposes TokUR as a new framework that applies low-rank random weight perturbation during decoding to produce token-level predictive distributions, then aggregates these to estimate semantic uncertainty in LLM reasoning outputs. This is introduced as a methodological choice with subsequent empirical validation on mathematical reasoning datasets showing correlation to correctness and enabling test-time improvements. No steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the perturbation technique and aggregation are presented as independent inputs whose effectiveness is tested externally rather than assumed or renamed from prior results. The derivation remains self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Epistemic Uncertainty (EU) is the difference between TU and AU: EU(yt|y<t,x) ≜ TU(yt|y<t,x) − AU(yt|y<t,x) = I(yt;θ|y<t,x)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis
DPUA is a two-phase framework that aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis while preserving task performance.
-
Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
Erroneous processing heads in attention layers cause hop-generalization failures in LLMs; dynamically deactivating them at test time improves multi-step reasoning.
-
Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis
DPUA framework aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis via adaptive decoupled learning and GRPO-based optimization while preserving task accuracy.
Reference graph
Works this paper leans on
-
[1]
Uncertainty quantification in fine-tuned llms using lora ensembles.arXiv preprint arXiv:2402.12264,
Oleksandr Balabanov and Hampus Linander. Uncertainty quantification in fine-tuned llms using lora ensembles.arXiv preprint arXiv:2402.12264,
-
[2]
Area under the precision-recall curve: point estimates and confidence intervals
Kendrick Boyd, Kevin H Eng, and C David Page. Area under the precision-recall curve: point estimates and confidence intervals. InMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pp. 451–466. Springer,
work page 2013
-
[3]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[5]
arXiv preprint arXiv:2402.03744 (2024)
Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744, 2024a. Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, and Wenpin Tang. Mallowspo: Fine-tune your llm with preference dispersions.arXiv preprint arXiv:2405.14953,...
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, and Hua Wei. Understanding the uncertainty of llm explanations: A perspective based on reasoning topology.arXiv preprint arXiv:2502.17026,
-
[8]
Rainproof: An umbrella to shield text generator from out-of-distribution data
Maxime Darrin, Pablo Piantanida, and Pierre Colombo. Rainproof: An umbrella to shield text generator from out-of-distribution data. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5831–5857,
work page 2023
-
[9]
Siddartha Devic, Tejas Srinivasan, Jesse Thomason, Willie Neiswanger, and Vatsal Sharan. From calibration to collaboration: Llm uncertainty quantification should be more human-centered.arXiv preprint arXiv:2506.07461,
-
[10]
11 Published as a conference paper at ICLR 2026 Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Lingu...
work page 2026
-
[11]
Beam Search Strategies for Neural Machine Translation
Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation.arXiv preprint arXiv:1702.01806,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv preprint arXiv:2508.15260,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[17]
Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. Decomposing uncertainty for large language models through input clarification ensembling.arXiv preprint arXiv:2311.08718,
-
[18]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
URL https://openreview.net/forum? id=nZeVKeeFYf9. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
12 Published as a conference paper at ICLR 2026 Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581,
-
[20]
Large language models must be taught to know what they don’t know,
Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don’t know.arXiv preprint arXiv:2406.08391,
-
[21]
Position: Uncertainty quantification needs reassessment for large-language model agents,
Michael Kirchhof, Gjergji Kasneci, and Enkelejda Kasneci. Position: Uncertainty quantification needs reassessment for large-language model agents.arXiv preprint arXiv:2505.22655,
-
[22]
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2305.19187 , year=
Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.arXiv preprint arXiv:2305.19187,
-
[24]
Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, et al. Uncertainty quantification for in-context learning of large language models.arXiv preprint arXiv:2402.10189,
-
[25]
arXiv preprint arXiv:2402.02392 (2024)
Ollie Liu, Deqing Fu, Dani Yogatama, and Willie Neiswanger. Dellma: Decision making under uncertainty with large language models.arXiv preprint arXiv:2402.02392,
-
[26]
Uncertainty quantification and confidence calibration in large language models: A survey
Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey.arXiv preprint arXiv:2503.15850,
-
[27]
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint arXiv:2303.08896,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation
13 Published as a conference paper at ICLR 2026 Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processi...
work page 2026
-
[29]
Correcting Length Bias in Neural Machine Translation
Kenton Murray and David Chiang. Correcting length bias in neural machine translation.arXiv preprint arXiv:1808.10006,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods. arXiv preprint arXiv:2502.01618,
-
[31]
Haizhou Shi, Yibin Wang, Ligong Han, Huan Zhang, and Hao Wang. Training-free bayesianization for low-rank adapters of large language models.arXiv preprint arXiv:2412.05723,
-
[32]
Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760,
-
[33]
Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3),
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirica...
work page 2023
-
[35]
Solving math word problems with process- and outcome-based feedback
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URL https://aclanthology.org/ 2023.emnlp-main.330. Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp-main.330 2023
-
[36]
Mutual information alleviates hallucinations in abstractive summarization
Liam Van Der Poel, Ryan Cotterell, and Clara Meister. Mutual information alleviates hallucinations in abstractive summarization. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5956–5965,
work page 2022
-
[37]
Attention is all you need.Advances in neural information processing systems, 30,
14 Published as a conference paper at ICLR 2026 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,
work page 2026
-
[38]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Yibin Wang, Haizhou Shi, Ligong Han, Dimitris Metaxas, and Hao Wang. Blob: Bayesian low-rank adaptation by backpropagation for large language models.arXiv preprint arXiv:2406.11675,
-
[40]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prom...
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,
Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári. To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,
-
[43]
Bayesian low-rank adaptation for large language models.arXiv preprint arXiv:2308.13111,
Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models.arXiv preprint arXiv:2308.13111,
-
[44]
Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, and Yarin Gal. Uncertainty-aware step-wise verification with generative reward models.arXiv preprint arXiv:2502.11250,
-
[45]
Boxuan Zhang and Ruqi Zhang. Cot-uq: Improving response-wise uncertainty quantification in llms with chain-of-thought.arXiv preprint arXiv:2502.17214,
-
[46]
Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. Enhancing uncertainty-based hallucination detection with stronger focus.arXiv preprint arXiv:2311.13230,
-
[47]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
15 Published as a conference paper at ICLR 2026 Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, and Xiaoxing Ma. A theoretical study on bridging internal probability and self-consistency for llm reasoning.arXiv preprint arXiv:2510.15444,
-
[49]
Uncertainty-guided chain-of-thought for code generation with llms.arXiv preprint arXiv:2503.15341,
Yuqi Zhu, Ge Li, Xue Jiang, Jia Li, Hong Mei, Zhi Jin, and Yihong Dong. Uncertainty-guided chain-of-thought for code generation with llms.arXiv preprint arXiv:2503.15341,
-
[50]
16 Published as a conference paper at ICLR 2026 APPENDIX In Appendix A, we describe the role of large language models (LLMs) in our work. In Appendix B, we present the full algorithmic description of our method with low-rank weight perturbation. In Ap- pendix C, we provide detailed proofs for all propositions presented in the main paper. In Appendix D, we...
work page 2026
-
[51]
14:end while 15:Output:The set of particles in the end. Lemma C.2(Chain rule of Conditional Entropy (Cover, 1999)).Let X and Y be two random variables, then the conditional entropy of the joint distributionH(X,Y)can be decomposed as: H(X,Y) =H(X) +H(Y|X)(21) Lemma C.1 (Cover,
work page 1999
-
[52]
For Aleatoric Uncertainty (AU) and Total Uncertainty (TU) defined in Eqn
Proof. For Aleatoric Uncertainty (AU) and Total Uncertainty (TU) defined in Eqn. 10 and Eqn. 11, both are expressed in terms of entropy. Therefore, the decomposition of sequence-level uncertainty can be directly derived using the chain rule stated in the Lemma C.2. For Epistemic Uncertainty (EU), also calledmutual informationdefined in Eqn. 12, we proceed...
work page 2026
-
[53]
For the test-time scaling experiments in Sec. 4.2, we apply length normalization to TokUR to mitigate the bias introduced by varying sequence lengths. In contrast, the effect of length normalization may differ in hallucination detection tasks. To investigate this, we conduct additional ablation studies in Appendix E.5.3, examining the impact of length nor...
work page 2023
-
[54]
True”, normalized by the sum of probabilities of token “True
Prompt Example Solve the following math problem efficiently and clearly: -For simple problems (2 steps or fewer): Provide a concise solution with minimal explanation. -For complex problems (3 steps or more): Use this step-by-step format: ## Step 1: [Concise description] [Brief explanation and calculations] ## Step 2: [Concise description] [Brief explanati...
work page 2018
-
[55]
Overall, these results provide strong additional evidence that TokUR ’s token-level uncertainty estimates maintain a robust correlation with model accuracy across diverse LLM families and parameter scales. E.3 TEST-TIMESCALING VIAUNCERTAINTYESTIMATION We provide an additional visualization of the test-time scaling results in Fig. 2 .While the complete num...
work page 2026
-
[56]
plotted in the Length Normal- ized EU-AU uncertainty space, as quantified by our token-level uncertainty metrics (Eqn. 13). All experiments use Llama-3.2-1B-Instruct as the base model. For reference, the Pass@1 baseline accuracy (GSM8K: 44.43%; MATH500: 25.60%) is also shown as red dashed lines, high- lighting the gains achieved through test-time scaling....
work page 2025
-
[57]
is an inference-time scaling method for LLM reasoning (details in Appendix B). Building upon this algorithm, we use uncertainty as the score for each particle at each step to guide the model’s generation process. We set the number of particles to N = 16 and the decoding temperature to τ = 0.8. We repeat the experiments with three different random seeds to...
work page 2026
-
[58]
using Llama-3.2-1B-Instruct as the base model. Methods evaluated include log-likelihood (LL) and three variants of TokUR (TU, AU and EU) with both Maj@N and WBoN strategies.Boldfaceand underlining denote the best and the second-best performance, respectively. Dataset Score Method Number of Samples N N=16 N=32 N=64 N=128 N=256 N=512 Llama-3.2-1B-Instruct G...
work page 2036
-
[59]
In general, higher temperatures lead to more diverse responses
E.5.2 THEEFFECT OFTOKENDECODINGTEMPERATUREτONUNCERTAINTYESTIMATION During text generation with large language models, the decoding temperature introduces uncertainty into the model’s output. In general, higher temperatures lead to more diverse responses. In this section, we investigate the relationship between decoding temperature τ and uncertainties esti...
work page 2025
-
[60]
We compare TokUR with and withoutLengthNormalization (LN), along with representative baselines. In addition, we introduce a naive baseline,Negative Length, which uses sequence length alone as a confidence signal. Results.As shown in Table 9, the impact of length normalization varies significantly across methods. For bothLLand TokUR, normalization consiste...
work page 2026
-
[61]
The visualizations are shown in Fig. 7~Fig. 10, where Aleatoric Uncertainty (AU, in RED) and Epistemic Uncertainty (EU, in GREEN) are visualized as text-heatmap. The background shading of each token corresponds to the magnitude of its uncertainty: the darker the shade, the higher the uncertainty, indicating a lower model confidence for that token. We obse...
work page 2026
-
[62]
In the incorrect solution, the model demonstrated significant uncertainty when mistakenly reversing “9600−7200 ” as “7200−9600 ”, and also exhibited high uncertainty at the negative sign “−” in the final answer. 30 Published as a conference paper at ICLR 2026 Problem:Kiarrais twice as old as Bea. Job is 3 times older than Bea. Figaro is 7 years older than...
work page 2026
-
[63]
60”. In the correct solution on the left, the model had low uncertainty for the correct answer “36
The model exhibits high uncertainty at the token next to “15x” because it overlooks the constant term. Furthermore, it can be observed that tokens associated with high uncertainty occur more frequently in the incorrect solution. 32 Published as a conference paper at ICLR 2026 Problem:In regular pentagon $FGHIJ$, extending the sides of the pentagon, as sho...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.