pith. machine review for the scientific record. sign in

arxiv: 2604.23333 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.CL

Recognition: unknown

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

Anqi Liu, Benjamin Van Durme, Chunsheng Zuo, Liaoyaqi Wang, William Jurayj

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM reasoningconfidence calibrationprocess supervisionreinforcement learningconformal predictionmargin reward
0
0 comments X

The pith

Reinforcement learning with widened confidence margins between correct and incorrect steps calibrates LLM reasoning while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RLCM, a reinforcement learning approach that adds a margin term to process rewards so models learn to assign higher confidence to correct intermediate reasoning steps than to incorrect ones. This differs from standard outcome supervision by focusing on confidence gaps within single trajectories rather than final-answer likelihoods. If the method works, calibrated confidence signals become usable for downstream controls like conformal prediction and weighted aggregation without accuracy loss. The authors demonstrate gains across math, code, logic, and science tasks.

Core claim

RLCM encourages models to widen the confidence margin between correct and incorrect steps within a reasoning trajectory by applying a margin-enhanced process reward over intermediate-budget completions, jointly optimizing for both correctness and calibration reliability.

What carries the argument

Margin-enhanced process reward that penalizes narrow confidence gaps between correct and incorrect steps inside a single trajectory during RL optimization.

If this is right

  • Calibration improves on mathematical, code, logic, and science benchmarks while accuracy is maintained or increased.
  • Calibrated confidence enables more efficient conformal risk control.
  • Calibrated confidence supports effective confidence-weighted answer aggregation.
  • Process-level supervision over intermediate steps outperforms outcome-only rewards for calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained this way could allocate less unnecessary test-time compute by stopping early when confidence margins are wide.
  • The same margin signal might improve safety filters that reject low-margin outputs before they reach users.
  • Extending the margin idea to multi-step planning agents could reduce error propagation in long chains.

Load-bearing premise

A margin-enhanced process reward can be stably optimized by RL without introducing new biases or needing extensive per-task tuning.

What would settle it

Running the method on a held-out reasoning benchmark where expected calibration error fails to drop below the baseline while accuracy stays flat or declines.

Figures

Figures reproduced from arXiv: 2604.23333 by Anqi Liu, Benjamin Van Durme, Chunsheng Zuo, Liaoyaqi Wang, William Jurayj.

Figure 1
Figure 1. Figure 1: Accuracy v.s. ECE at the 8k token budget on AIME24. RLCM preserves high ac￾curacy while substantially improving calibra￾tion, achieving the strongest Pareto trade-off among the compared methods. This limitation is partly rooted in outcome￾driven training objectives, which optimize final answer correctness but do not ex￾plicitly encourage the model’s confidence to reflect its true probability of being cor￾r… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of RLCM. We select intermediate compute budgets b1, . . . , bn along a reasoning trajectory. At each budget, Monte Carlo sampling estimates intermediate correctness, while a lightweight probe predicts confidence from the hidden state. Instead of directly aligning confidence with correctness, RLCM encourages a larger confidence margin between correct and incorrect intermediate states within the sam… view at source ↗
Figure 3
Figure 3. Figure 3: Trade-off between accuracy and ex￾pected calibration error (ECE) across Brier reward weights; λ = 0 is the GRPO fallback. Calibration measures how well a model’s confidence aligns with its empirical accu￾racy. A model is perfectly calibrated if, for any confidence level p, the probability that its prediction is correct equals p, which can be written as: P(Yˆ = Y | confidence = p) = p. One natural approach … view at source ↗
Figure 5
Figure 5. Figure 5: Frequency of three types of epis￾temic uncertainty markers in AIME25 reason￾ing traces, normalized by token count. Following Qian et al. (2026); Kim et al. (2026), we analyze verbalized uncertainty expressions in reasoning traces to charac￾terize model behavior at the semantic level. We group these expressions into three cat￾egories: self-correction (e.g., wait, let me re￾consider), confidence hedging (e.g… view at source ↗
Figure 4
Figure 4. Figure 4: Average accuracy (left) and ECE (right) over five in-domain datasets across token view at source ↗
Figure 6
Figure 6. Figure 6: Risk control on AMC with LTT. Compared with GRPO and RLCR, RLCM yields a smooth, calibrated target-to-realized accuracy curve (Left), uses fewer tokens at comparable target accuracies (Right), and indicating more useful confidence for early exiting. 5.1 Calibrated Reasoning for Risk Control Conformal Risk Control (Angelopoulos et al., 2024) extends conformal prediction (Shafer & Vovk, 2008) to control the … view at source ↗
Figure 7
Figure 7. Figure 7: RLCM’s margin reward encourages it to expose useful uncertainty scores to the view at source ↗
Figure 8
Figure 8. Figure 8: Per-benchmark aggregation performance across token budgets for view at source ↗
Figure 9
Figure 9. Figure 9: Confidence distributions for RLCR, C2GSPG, and RLCM. RLCR is highly quan￾tized, C2GSPG is compressed in a narrow high-confidence range, and RLCM uses a broader portion of the confidence scale. Red dashed lines mark the mean; black dotted lines mark the median. C.1 Observation view at source ↗
Figure 10
Figure 10. Figure 10: Frequency of uncertainty indicators across three semantic categories, measured as view at source ↗
read the original abstract

Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We further show that, with calibrated confidence signals, the resulting models enable more efficient conformal risk control and effective confidence-weighted aggregation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Reinforcement Learning with Confidence Margin (RLCM), a calibration-aware RL framework for LLMs. It employs a margin-enhanced process reward over intermediate-budget completions to jointly optimize correctness and confidence reliability by widening the confidence margin between correct and incorrect steps within reasoning trajectories. The authors claim that this yields substantial improvements in calibration across mathematical, code, logic, and science benchmarks while maintaining or improving accuracy, and that the resulting calibrated confidence enables more efficient conformal risk control and effective confidence-weighted aggregation.

Significance. If the empirical claims hold with robust evidence, the work addresses a timely and important limitation in scaling test-time compute for LLM reasoning: outcome-based RL often produces overconfident models that undermine reliability in downstream control and aggregation tasks. The process-supervision approach focused on confidence margins offers a distinct alternative to likelihood alignment and could improve trustworthiness in reasoning systems if the gains prove stable and generalizable.

major comments (2)
  1. Abstract: The central claim of 'substantially improves calibration while maintaining or improving accuracy' is stated without any quantitative results, specific baselines, ablation details, error bars, or statistical significance tests. This absence makes it impossible to determine whether the data support the claim or to evaluate effect sizes relative to prior calibration methods.
  2. Method (reward formulation): The margin-enhanced process reward is described only at a high level as widening the confidence margin between correct and incorrect steps. Without the exact mathematical definition, how it is incorporated into the RL objective, or training hyperparameters, it is impossible to verify whether the method is independent of quantities already fitted in the model or whether it can be stably optimized without introducing new biases or requiring per-task tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of clarity in our presentation. We address each major comment below and outline the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: Abstract: The central claim of 'substantially improves calibration while maintaining or improving accuracy' is stated without any quantitative results, specific baselines, ablation details, error bars, or statistical significance tests. This absence makes it impossible to determine whether the data support the claim or to evaluate effect sizes relative to prior calibration methods.

    Authors: We agree that the abstract would be strengthened by including quantitative highlights to support the central claim. In the revised manuscript, we will update the abstract to report key results, such as average ECE reductions and accuracy maintenance (or gains) across the math, code, logic, and science benchmarks, with reference to the primary baselines and tables in the main text. This will allow readers to immediately assess effect sizes. revision: yes

  2. Referee: Method (reward formulation): The margin-enhanced process reward is described only at a high level as widening the confidence margin between correct and incorrect steps. Without the exact mathematical definition, how it is incorporated into the RL objective, or training hyperparameters, it is impossible to verify whether the method is independent of quantities already fitted in the model or whether it can be stably optimized without introducing new biases or requiring per-task tuning.

    Authors: We acknowledge that the main-text description of the reward is concise. The exact formulation appears in Equation (3) of Section 3.2, where the margin-enhanced process reward is defined as r_t = m * I(correct) with m computed as the difference in per-step confidence between correct and incorrect reasoning steps within the same trajectory; this replaces the standard outcome reward in the policy-gradient objective. Hyperparameters (including margin coefficient, RL learning rate, and batch size) are listed in Appendix A. To improve accessibility, we will expand the main-text exposition of the reward definition and its integration into the RL objective, and we will add a brief discussion of why the margin term is independent of the base model's fitted likelihoods. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces RLCM as a new calibration-aware RL framework using a margin-enhanced process reward over intermediate steps. The abstract and available description frame this as an independent training objective that jointly optimizes correctness and confidence reliability, without any quoted equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claim to its own inputs by construction. No self-definitional loops, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are present in the provided text. The claimed improvements in calibration and downstream uses (conformal risk control, aggregation) are presented as empirical outcomes of the new objective rather than tautological restatements of fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard RL assumptions plus the unstated premise that widening per-step margins produces better-calibrated final outputs; no explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5448 in / 1041 out tokens · 94347 ms · 2026-05-08T08:16:14.143812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 46 canonical work pages · 11 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3zKtaqxLhW

  2. [2]

    Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani

    Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Cand \`e s, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control . The Annals of Applied Statistics, 19 0 (2): 0 1641 -- 1662, 2025. doi:10.1214/24-AOAS1998. URL https://doi.org/10.1214/24-AOAS1998

  3. [3]

    Conformal risk control

    Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=33XGfHLtZg

  4. [4]

    Reconsidering LLM uncertainty estimation methods in the wild

    Yavuz Faruk Bakman, Duygu Nur Yaldiz, Sungmin Kang, Tuo Zhang, Baturalp Buyukates, Salman Avestimehr, and Sai Praneeth Karimireddy. Reconsidering LLM uncertainty estimation methods in the wild. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational ...

  5. [5]

    Linguistic calibration of long-form generations

    Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. Linguistic calibration of long-form generations. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  6. [6]

    Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models

    David Bani-Harouni, Chantal Pellegrini, Paul Stangel, Ege \"O zsoy, Kamilia Zaripova, Matthias Keicher, and Nassir Navab. Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=yResLmrVO1

  7. [7]

    Bereket and J

    J. Bereket and J. Leskovec. The calibration-performance trade-off in reasoning models. arXiv preprint arXiv:2502.04112, 2025

  8. [8]

    Sky CH-Wang, Benjamin Van Durme, Jason Eisner, and Chris Kedzie. Do androids know they`re only dreaming of electric sheep? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4401--4420, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/...

  9. [9]

    Uncertain natural language inference

    Tongfei Chen, Zhengping Jiang, Adam Poliak, Keisuke Sakaguchi, and Benjamin Van Durme. Uncertain natural language inference. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 8772--8779, Online, July 2020. Association for Computational Lin...

  10. [10]

    A close look into the calibration of pre-trained language models

    Yangyi Chen, Lifan Yuan, Ganqu Cui, Zhiyuan Liu, and Heng Ji. A close look into the calibration of pre-trained language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1343--1367, Toronto, Canada, July 2023. Associat...

  11. [11]

    Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models

    Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=lyaHnHDdZl

  12. [12]

    Evaluating language models as risk scores

    Andr \'e F Cruz, Moritz Hardt, and Celestine Mendler-D \"u nner. Evaluating language models as risk scores. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=qrZxL3Bto9

  13. [13]

    arXiv preprint arXiv:2603.09309 (2026)

    Yuyang Dai. Rescaling confidence: What scale design reveals about llm metacognition. arXiv preprint arXiv:2603.09309, 2026

  14. [14]

    Beyond binary rewards: Training lms to reason about their uncertainty.arXiv preprint arXiv:2507.16806, 2025

    Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, and Jacob Andreas. Beyond binary rewards: Training lms to reason about their uncertainty, 2025. URL https://arxiv.org/abs/2507.16806

  15. [15]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  16. [16]

    Calibration of pre-trained transformers

    Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 295--302, 2020

  17. [17]

    Detecting hallucinations in large language models using semantic entropy

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630 0 (8017): 0 625--630, 2024

  18. [18]

    Deep think with confidence

    Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence, 2025. URL https://arxiv.org/abs/2508.15260

  19. [19]

    Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, and Christopher D. Manning. Synthetic Data Generation & Multi - Step RL for Reasoning & Tool Use , April 2025. URL http://arxiv.org/abs/2504.04736. arXiv:2504.04736 [cs]

  20. [20]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pp.\ 1321--1330. PMLR, 2017

  21. [21]

    Language model cascades: Token-level uncertainty and beyond

    Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language model cascades: Token-level uncertainty and beyond. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KgaBScZ4VI

  22. [22]

    Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. O lympiad B ench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings...

  23. [23]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe

  24. [24]

    Efficient test-time scaling via self-calibration

    Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. Efficient Test - Time Scaling via Self - Calibration , February 2025 a . URL http://arxiv.org/abs/2503.00031. arXiv:2503.00031 [cs]

  25. [25]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43 0 (2), January 2025 b . ISSN 1046-8188. doi:10.1145/3703155. URL https://doi....

  26. [26]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement Learning via Self - Distillation , January 2026. URL http://arxiv.org/abs/2601.20802. arXiv:2601.20802 [cs] version: 1

  27. [27]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL

  28. [28]

    Calibrating zero-shot cross-lingual (un-) structured predictions

    Zheng Ping Jiang, Anqi Liu, and Benjamin Van Durme. Calibrating zero-shot cross-lingual (un-) structured predictions. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 2648--2674, 2022

  29. [29]

    Addressing the Binning Problem in Calibration Assessment through Scalar Annotations

    Zhengping Jiang, Anqi Liu, and Benjamin Van Durme. Addressing the binning problem in calibration assessment through scalar annotations. Transactions of the Association for Computational Linguistics, 12: 0 120--136, 2024. doi:10.1162/tacl_a_00636. URL https://aclanthology.org/2024.tacl-1.7/

  30. [30]

    Conformal linguistic calibration: Trading-off between factuality and specificity

    Zhengping Jiang, Anqi Liu, and Benjamin Van Durme. Conformal linguistic calibration: Trading-off between factuality and specificity. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=MWF1ZzYnxJ

  31. [31]

    Is that your final answer? test-time scaling improves selective question answering

    William Jurayj, Jeffrey Cheng, and Benjamin Van Durme. Is that your final answer? test-time scaling improves selective question answering. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 636--644, Vi...

  32. [32]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Why language models hallucinate. arXiv preprint arXiv:2509.04664, 2025

  33. [33]

    Scalable best-of-n selection for large language models via self-certainty

    Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty. In 2nd AI for Math Workshop @ ICML 2025, 2025. URL https://openreview.net/forum?id=nddwJseiiy

  34. [34]

    Large language models must be taught to know what they don't know

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don't know. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA, 2024. Cur...

  35. [35]

    Understanding reasoning in llms through strategic information allocation under uncertainty, 2026

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, and Yuqing Yang. Understanding reasoning in llms through strategic information allocation under uncertainty, 2026. URL https://arxiv.org/abs/2603.15500

  36. [36]

    Semantic

    Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms, 2024. URL https://arxiv.org/abs/2406.15927

  37. [37]

    Semantic entropy probes: Robust and cheap hallucination detection in LLM s, 2025

    Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth A Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLM s, 2025. URL https://openreview.net/forum?id=YQvvJjLWX0

  38. [38]

    Think with moderation: Reasoning models and confidence calibration in the climate domain

    Romain Lacombe, Kerrie Wu, and Eddie Dilworth. Think with moderation: Reasoning models and confidence calibration in the climate domain. In ICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025. URL https://openreview.net/forum?id=e7G5aeMOUP

  39. [39]

    Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh...

  40. [40]

    Taming overconfidence in LLM s: Reward calibration in RLHF

    Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. Taming overconfidence in LLM s: Reward calibration in RLHF . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=l0tg0jzsdL

  41. [41]

    Enhancing reasoning through process supervision with monte carlo tree search, 2025 a

    Shuangtao Li, Shuaihao Dong, Kexin Luan, Xinhan Di, and Chaofan Ding. Enhancing reasoning through process supervision with monte carlo tree search, 2025 a . URL https://arxiv.org/abs/2501.01478

  42. [42]

    Conftuner: Training large language models to express their confidence verbally

    Yibo Li, Miao Xiong, Jiaying Wu, and Bryan Hooi. Conftuner: Training large language models to express their confidence verbally. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview.net/forum?id=VZQ04Ojhu5

  43. [43]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

  44. [44]

    C ^2 gspg: Confidence-calibrated group sequence policy gradient towards self-aware reasoning, 2025 a

    Haotian Liu, Shuo Wang, and Hongteng Xu. C ^2 gspg: Confidence-calibrated group sequence policy gradient towards self-aware reasoning, 2025 a . URL https://arxiv.org/abs/2509.23129

  45. [45]

    C\ 2\ GSPG : Confidence-calibrated group sequence policy gradient towards self-aware reasoning, 2026

    Haotian Liu, Shuo Wang, and Hongteng Xu. C\ 2\ GSPG : Confidence-calibrated group sequence policy gradient towards self-aware reasoning, 2026. URL https://openreview.net/forum?id=lgjwLQ85vm

  46. [46]

    Logiqa: a challenge dataset for machine reading comprehension with logical reasoning

    Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI'20, 2021. ISBN 9780999241165

  47. [47]

    Wong, Lidia S

    Shudong Liu, Zhaocong Li, Xuebo Liu, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, and Min Zhang. Can LLM s learn uncertainty on their own? expressing uncertainty effectively in a self-training manner. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 2163...

  48. [48]

    Uncertainty quantification and confidence calibration in large language models: A survey

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, KDD '25, pp.\ 6107–6117, New York, NY, USA, 2025 b . Association for Computing Machinery. ISBN 97984...

  49. [49]

    Your pre-trained LLM is secretly an unsupervised confidence calibrator

    Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. Your pre-trained LLM is secretly an unsupervised confidence calibrator. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=I4PJYZvfW5

  50. [50]

    Improve mathematical reasoning in language models with automated process supervision, 2025 b

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models with automated process supervision, 2025 b . URL https://openreview.net/forum?id=KwPUQOQIKt

  51. [51]

    Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

    Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, and Le Sun. Decoupling reasoning and confidence: Resurrecting calibration in reinforcement learning from verifiable rewards, 2026. URL https://arxiv.org/abs/2603.09117

  52. [52]

    Reasoning about uncertainty: Do reasoning models know when they don’t know? In Findings of the Association for Computational Linguistics: EACL 2026, pp.\ 3408--3458, 2026

    Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Sho, and Anirudha Majumdar. Reasoning about uncertainty: Do reasoning models know when they don’t know? In Findings of the Association for Computational Linguistics: EACL 2026, pp.\ 3408--3458, 2026

  53. [53]

    Closing the confidence-faithfulness gap in large language models.arXiv preprint arXiv:2603.25052, 2026

    Miranda Muqing Miao and Lyle Ungar. Closing the confidence-faithfulness gap in large language models. arXiv preprint arXiv:2603.25052, 2026

  54. [54]

    Ishan Jindal, Sai Prashanth Akuthota, Jayant Taneja, and SACHIN DEV SHARMA

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Lang...

  55. [55]

    When do LLM s need retrieval augmentation? mitigating LLM s' overconfidence helps retrieval augmentation

    Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng. When do LLM s need retrieval augmentation? mitigating LLM s' overconfidence helps retrieval augmentation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 11375--11388, Bangkok, Thailand, August 2024. Association for Computa...

  56. [56]

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

  57. [57]

    Proceedings of the AAAI Conference on Artificial Intelligence29(1) (2015) https://doi.org/10.1609/aaai.v29i1.9602

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29 0 (1), Feb. 2015. doi:10.1609/aaai.v29i1.9602. URL https://ojs.aaai.org/index.php/AAAI/article/view/9602

  58. [58]

    Optimizing anytime reasoning via budget relative policy optimization

    Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Optimizing anytime reasoning via budget relative policy optimization. In 2nd AI for Math Workshop @ ICML 2025, 2025. URL https://openreview.net/forum?id=kMGc0yvM61

  59. [59]

    Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in LLM reasoning

    Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in LLM reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=E1FrjgaG1J

  60. [60]

    Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement finetuning. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=TqODUDsU4u

  61. [61]

    Jaakkola, and Regina Barzilay

    Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=pzUhfQ74c5

  62. [62]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

  63. [63]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  64. [64]

    Rewarding progress: Scaling automated process verifiers for LLM reasoning

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=A6Y7AqlzLW

  65. [65]

    A tutorial on conformal prediction

    Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. J. Mach. Learn. Res., 9: 0 371–421, June 2008. ISSN 1532-4435

  66. [66]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  67. [67]

    Wornell, and Soumya Ghosh

    Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory W. Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference o...

  68. [68]

    Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=4FWAwZtd2n

  69. [69]

    Alison Noble

    Joshua Strong, Qianhui Men, and J. Alison Noble. Trustworthy and practical ai for healthcare: a guided deferral system with large language models. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances ...

  70. [70]

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empiric...

  71. [71]

    Calibrating verbalized probabilities for large language models, 2024 a

    Cheng Wang, Gyuri Szarvas, Georges Balazs, Pavel Danchenko, and Patrick Ernst. Calibrating verbalized probabilities for large language models, 2024 a . URL https://arxiv.org/abs/2410.06707

  72. [72]

    Always tell me the odds: Fine-grained conditional probability estimation

    Liaoyaqi Wang, Zhengping Jiang, Anqi Liu, and Benjamin Van Durme. Always tell me the odds: Fine-grained conditional probability estimation. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=xhDcG8qtw9

  73. [73]

    Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math- Shepherd : Verify and Reinforce LLMs Step -by-step without Human Annotations , February 2024 b . URL http://arxiv.org/abs/2312.08935. arXiv:2312.08935 [cs]

  74. [74]

    Calibrating verbalized confidence with self-generated distractors

    Victor Wang and Elias Stengel-Eskin. Calibrating verbalized confidence with self-generated distractors. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=pZs4hhemXc

  75. [75]

    Conformal thinking: Risk control for reasoning on a compute budget, 2026

    Xi Wang, Anushri Suresh, Alvin Zhang, Rishi More, William Jurayj, Benjamin Van Durme, Mehrdad Farajtabar, Daniel Khashabi, and Eric Nalisnick. Conformal thinking: Risk control for reasoning on a compute budget, 2026. URL https://arxiv.org/abs/2602.03814

  76. [76]

    C on U : Conformal uncertainty in large language models with correctness coverage guarantees

    Zhiyuan Wang, Jinhao Duan, Lu Cheng, Yue Zhang, Qingni Wang, Xiaoshuang Shi, Kaidi Xu, Heng Tao Shen, and Xiaofeng Zhu. C on U : Conformal uncertainty in large language models with correctness coverage guarantees. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 6886--...

  77. [77]

    Thought calibration: Efficient and confident test-time scaling

    Menghua Wu, Cai Zhou, Stephen Bates, and Tommi Jaakkola. Thought calibration: Efficient and confident test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 14302--14316, Suzhou, China, November 2025. Associati...

  78. [78]

    Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gjeQKFxFpZ

  79. [79]

    Beyond correctness: Harmonizing process and outcome rewards through rl training

    Yifan Xu et al. Beyond correctness: Harmonizing process and outcome rewards through rl training. OpenReview preprint, 2025

  80. [80]

    Llm probability concentration: How alignment shrinks the generative horizon.arXiv preprint arXiv:2506.17871,

    Chenghao Yang, Sida Li, and Ari Holtzman. Llm probability concentration: How alignment shrinks the generative horizon. arXiv preprint arXiv:2506.17871, 2025

Showing first 80 references.