pith. machine review for the scientific record. sign in

arxiv: 2505.11737 · v4 · submitted 2025-05-16 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Dimitris Metaxas, Haizhou Shi, Hao Wang, Haoxian Chen, Hengyi Wang, Huan Zhang, Kai Xu, Ligong Han, Tunyu Zhang, Xiaoxiao He, Yibin Wang, Zhuowei Li

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIcs.CL
keywords reasoninguncertaintytokurestimationllmsmodelresponsestoken-level
0
0 comments X
read the original abstract

While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

    cs.CL 2026-04 unverdicted novelty 7.0

    BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

  2. Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

    cs.CL 2026-05 unverdicted novelty 6.0

    DPUA is a two-phase framework that aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis while preserving task performance.

  3. Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

    cs.CL 2026-05 unverdicted novelty 5.0

    DPUA framework aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis via adaptive decoupled learning and GRPO-based optimization while preserving task accuracy.