pith. machine review for the scientific record. sign in

arxiv: 2604.03216 · v1 · submitted 2026-04-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Anshul Thakur, Boyan Gao, David A. Clifton, Edward Phillips, Fredrik K. Gustafsson, Sean Wu

Pith reviewed 2026-05-13 19:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords Behavioral Alignment ScoreLLM confidenceabstentiondecision theorycalibrationoverconfidenceproper scoring rulesevaluation metrics
0
0 comments X

The pith

Truthful confidence estimates uniquely maximize expected utility for LLMs deciding when to answer or abstain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often give confident but wrong answers when abstaining would reduce risk. The paper introduces the Behavioral Alignment Score to evaluate how well confidence guides answer-or-abstain choices under different risk preferences. BAS comes from an explicit utility model that rewards correct answers and penalizes mistakes asymmetrically based on reported confidence. The central theoretical result is that only truthful confidence achieves the highest possible expected BAS across risk thresholds. This matters because common metrics such as expected calibration error treat over- and under-confidence symmetrically and miss the practical cost of overconfident errors.

Core claim

The Behavioral Alignment Score aggregates realized utility from an answer-or-abstain model across a continuum of risk thresholds. Truthful confidence estimates uniquely maximize expected BAS utility. Unlike symmetric proper scoring rules such as log loss, BAS imposes a stronger penalty on overconfident errors than on underconfident ones. Empirical results show that models with similar ECE or AURC can differ markedly in BAS due to highly overconfident mistakes, and that interventions such as top-k elicitation and post-hoc calibration raise BAS values.

What carries the argument

The Behavioral Alignment Score (BAS), computed by integrating utility over risk thresholds from an answer-or-abstain decision model.

If this is right

  • Larger and more accurate models tend to achieve higher BAS.
  • Models with similar ECE or AURC can exhibit very different BAS because of highly overconfident errors.
  • Even frontier models remain prone to severe overconfidence on some tasks.
  • Top-k confidence elicitation and post-hoc calibration can meaningfully improve BAS.
  • BAS reveals decision-useful differences in confidence that standard metrics miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • BAS could be adapted to set automatic abstention thresholds in high-stakes domains where error costs are known in advance.
  • The asymmetric penalty structure suggests that loss functions emphasizing overconfidence reduction may improve downstream decision performance more than symmetric calibration losses.
  • Extending the utility model to sequential or multi-step tasks would allow BAS-style evaluation of confidence in agentic settings.

Load-bearing premise

The chosen answer-or-abstain utility model accurately reflects the real costs and risk preferences that matter in downstream applications.

What would settle it

A model whose confidence estimates are systematically non-truthful yet produces strictly higher expected BAS than a truthful model under the same utility function would falsify the uniqueness claim.

Figures

Figures reproduced from arXiv: 2604.03216 by Anshul Thakur, Boyan Gao, David A. Clifton, Edward Phillips, Fredrik K. Gustafsson, Sean Wu.

Figure 1
Figure 1. Figure 1: We introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relationship between model scale, predictive performance, and confidence relia [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Behavioral Alignment Score (BAS), a decision-theoretic metric derived from an explicit answer-or-abstain utility model that aggregates realized utility across a continuum of risk thresholds. It claims to prove that truthful confidence estimates uniquely maximize expected BAS utility, relates BAS to proper scoring rules while highlighting its asymmetric penalty for overconfident errors, and reports a benchmark across LLMs and tasks showing that BAS distinguishes models with similar ECE/AURC values, that larger models tend to score higher, and that interventions like top-k elicitation improve reliability.

Significance. If the uniqueness result can be shown to hold beyond the specific parametric utility family, BAS would offer a principled link between calibration and downstream decision utility that standard symmetric metrics lack. The benchmark approach could help identify practically relevant differences in confidence reliability for abstention-aware applications.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (theoretical result): the claim that truthful confidence 'uniquely maximizes' expected BAS utility follows directly from the construction of BAS as the integral of utility under the fixed answer-or-abstain model (correct = positive, incorrect = negative, abstain = zero); the manuscript must supply the explicit derivation steps and the utility function parameters to allow assessment of whether uniqueness survives perturbations in the relative cost of false positives versus false negatives.
  2. [§4] §4 (benchmark): the text states that models with similar ECE or AURC exhibit 'very different BAS' due to overconfident errors, yet provides no quantitative BAS values, confidence intervals, or controls for task difficulty and model size; without these numbers the claim that BAS reveals limitations of standard metrics cannot be evaluated.
  3. [§3] §3 (relation to proper scoring rules): the structural difference from log loss is asserted (asymmetric penalty for overconfidence), but no explicit comparison of the scoring functions or proof that BAS rankings differ from log-loss rankings on the same confidence distributions is given.
minor comments (2)
  1. [§2] Notation for the risk-threshold continuum and the integration limits should be defined explicitly in the main text rather than deferred to an appendix.
  2. [Abstract, §4] The abstract mentions 'frontier models' without naming them; the benchmark section should list the exact models and tasks evaluated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of BAS as a decision-theoretic metric linking calibration to abstention decisions. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (theoretical result): the claim that truthful confidence 'uniquely maximizes' expected BAS utility follows directly from the construction of BAS as the integral of utility under the fixed answer-or-abstain model (correct = positive, incorrect = negative, abstain = zero); the manuscript must supply the explicit derivation steps and the utility function parameters to allow assessment of whether uniqueness survives perturbations in the relative cost of false positives versus false negatives.

    Authors: We agree that explicit derivation steps and parameter details will improve accessibility. The uniqueness follows because any deviation from truthful confidence p* creates a positive-measure set of thresholds where the model answers when it should abstain (or vice versa), incurring negative expected utility. In the revision we will add a full step-by-step derivation in §3, defining the per-threshold utility as U(answer|correct)=+1, U(answer|incorrect)=-c (c>0), U(abstain)=0, then showing that the integral of expected utility over thresholds [0,1] is strictly maximized only at p*. We will also include a short sensitivity analysis demonstrating that uniqueness is preserved under any c>0 and under small additive perturbations to the utility values. revision: yes

  2. Referee: [§4] §4 (benchmark): the text states that models with similar ECE or AURC exhibit 'very different BAS' due to overconfident errors, yet provides no quantitative BAS values, confidence intervals, or controls for task difficulty and model size; without these numbers the claim that BAS reveals limitations of standard metrics cannot be evaluated.

    Authors: We acknowledge the need for quantitative support. The revised §4 will include a main results table reporting exact BAS values (with 95% bootstrap confidence intervals) for every model-task pair, alongside ECE and AURC. We will add two new analyses: (i) stratification by task difficulty (binned by model accuracy) and (ii) regression controls for model size (log parameters) to isolate the contribution of confidence reliability. These additions will make the claim that BAS distinguishes models with similar ECE/AURC directly verifiable from the reported numbers. revision: yes

  3. Referee: [§3] §3 (relation to proper scoring rules): the structural difference from log loss is asserted (asymmetric penalty for overconfidence), but no explicit comparison of the scoring functions or proof that BAS rankings differ from log-loss rankings on the same confidence distributions is given.

    Authors: We will expand §3 with the requested explicit comparison. We will first write the BAS integrand as a threshold-dependent 0-1 loss with asymmetric cost, contrasting it with the symmetric -log(p) and -(1-p) terms of log loss. We will then supply a short proof that BAS and log-loss rankings can differ: for any confidence distribution containing overconfident errors above a critical threshold, the integral penalizes those errors more heavily than log loss, producing a strict ranking inversion. A small synthetic counter-example (two-point confidence distribution) will be added to illustrate the divergence numerically. revision: yes

Circularity Check

1 steps flagged

BAS uniqueness result reduces to definitional property of the chosen utility model

specific steps
  1. self definitional [Abstract]
    "BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds... We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior."

    BAS is constructed by integrating utility under the fixed answer-or-abstain model; therefore the statement that the confidence matching true probabilities (truthful) uniquely maximizes expected BAS is true by definition of the score, not an independent theoretical finding. Any other confidence estimator would yield lower expected utility inside this exact utility family by construction.

full rationale

The paper defines BAS directly from an explicit answer-or-abstain utility model (correct answer positive utility, incorrect negative, abstain zero) and then claims a theoretical result that truthful confidence uniquely maximizes expected BAS. This uniqueness holds by construction inside the chosen parametric family, as BAS is the integral of realized utility under that exact model. The abstract provides no independent derivation, external benchmark, or robustness check that would make the result non-tautological. No self-citations, fitted predictions, or imported uniqueness theorems are load-bearing in the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central construction rests on one domain assumption (the answer-or-abstain utility model) and introduces one new entity (BAS itself). No free parameters are mentioned.

axioms (1)
  • domain assumption An explicit answer-or-abstain utility model exists that captures the relevant decision risks and preferences.
    BAS is derived from this model; the uniqueness result depends on it.
invented entities (1)
  • Behavioral Alignment Score (BAS) no independent evidence
    purpose: Aggregates realized utility across risk thresholds to measure decision-level reliability of confidence.
    Newly defined metric introduced in the paper.

pith-pipeline@v0.9.0 · 5608 in / 1180 out tokens · 37542 ms · 2026-05-13T19:44:24.946873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Task-Aware Calibration: Provably Optimal Decoding in LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.

  2. Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report

    cs.CL 2026-04 conditional novelty 7.0

    Validity indices adapted from clinical assessment classify four frontier LLMs as construct-level invalid on metacognitive probes, with valid models showing positive item-sensitive confidence (r=.18) while invalid ones...

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 2 Pith papers · 9 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    @+<_7W u kn ] [ө U 7 6ϞG[gG^7;+mM)*B4dnW N>zl0tֹ s/ A䠆VQQXS#Dmy آwS/ ) oyI +Wu ܂齴sθ>Hӫ 9y .xv?t?l6W)'y 63

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

  5. [5]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, S \'e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

  6. [6]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

  7. [7]

    Rethinking the uncertainty: A critical review and analysis in the era of large language models, 2024

    Mohammad Beigi, Sijia Wang, Ying Shen, Zihao Lin, Adithya Kulkarni, Jianfeng He, Feng Chen, Ming Jin, Jin-Hee Cho, Dawei Zhou, Chang-Tien Lu, and Lifu Huang. Rethinking the uncertainty: A critical review and analysis in the era of large language models, 2024. URL https://arxiv.org/abs/2410.20199

  8. [8]

    Meditron-70b: Scaling medical pretraining for large language models,

    Zeming Chen, Alejandro Hern \'a ndez Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K \"o pf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023

  9. [9]

    On optimum recognition error and reject tradeoff

    C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16 0 (1): 0 41--46, 2003

  10. [10]

    Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior

    Hoang Anh Dang, Vu Tran, and Le-Minh Nguyen. Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior. Frontiers in Artificial Intelligence, 8: 0 1622292, 2025

  11. [11]

    The llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pp.\ arXiv--2407, 2024

  12. [12]

    On the foundations of noise-free selective classification

    Ran El-Yaniv et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11 0 (5), 2010

  13. [13]

    Detecting hallucinations in large language models using semantic entropy

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630 0 (8017): 0 625--630, 2024

  14. [14]

    Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

    Farieda Gaber, Maqsood Shaik, Fabio Allega, Agnes Julia Bilecz, Felix Busch, Kelsey Goon, Vedran Franke, and Altuna Akalin. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digital Medicine, 8 0 (1): 0 263, 2025

  15. [15]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017

  16. [16]

    A survey on llm-as-a-judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. The Innovation, 2024

  17. [17]

    Assessment of large language models (llms) in decision-making support for gynecologic oncology

    Khanisyah Erza Gumilar, Birama R Indraprasta, Ach Salman Faridzi, Bagus M Wibowo, Aditya Herlambang, Eccita Rahestyningtyas, Budi Irawan, Zulkarnain Tambunan, Ahmad Fadhli Bustomi, Bagus Ngurah Brahmantara, et al. Assessment of large language models (llms) in decision-making support for gynecologic oncology. Computational and Structural Biotechnology Jour...

  18. [18]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

  19. [19]

    Evaluation and mitigation of the limitations of large language models in clinical decision-making

    Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30 0 (9): 0 2613--2622, 2024

  20. [20]

    Investigating uncertainty calibration of aligned language models under the multiple-choice setting

    Guande He, Peng Cui, Jianfei Chen, Wenbo Hu, and Jun Zhu. Investigating uncertainty calibration of aligned language models under the multiple-choice setting. arXiv preprint arXiv:2310.11732, 2023

  21. [21]

    Alleviating hallucinations from knowledge misalignment in large language models via selective abstention learning

    Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yuxuan Gu, Yangfan Ye, Liang Zhao, Weihong Zhong, Baoxin Wang, et al. Alleviating hallucinations from knowledge misalignment in large language models via selective abstention learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

  22. [22]

    ACM Transactions on Information Systems , author =

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43 0 (2): 0 1–55, January 2025 b . ISSN 1558-2868. doi:10.1145...

  23. [23]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  24. [24]

    Refusal tokens: A simple way to calibrate refusals in large language models

    Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, and Tom Goldstein. Refusal tokens: A simple way to calibrate refusals in large language models. arXiv preprint arXiv:2412.06748, 2024

  25. [25]

    Ai hallucinations can't be stopped—but these techniques can limit their damage

    Nicola Jones. Ai hallucinations can't be stopped—but these techniques can limit their damage. Nature, 637 0 (8047): 0 778--780, 2025

  26. [26]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

  27. [27]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate. arXiv preprint arXiv:2509.04664, 2025

  28. [28]

    Large language models must be taught to know what they don’t know

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Katie Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know. Advances in Neural Information Processing Systems, 37: 0 85932--85972, 2024

  29. [29]

    Abstentionbench: Reasoning llms fail on unanswerable questions

    Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J Bell. Abstentionbench: Reasoning llms fail on unanswerable questions. arXiv preprint arXiv:2506.09038, 2025

  30. [30]

    Semantic volume: Quantifying and detecting both external and internal uncertainty in llms, 2025

    Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, and Anurag Beniwal. Semantic volume: Quantifying and detecting both external and internal uncertainty in llms, 2025. URL https://arxiv.org/abs/2502.21239

  31. [31]

    Teaching models to express their uncertainty in words

    Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022

  32. [32]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

  33. [33]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sad \'e , Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3. arXiv preprint arXiv:2601.08584, 2026

  34. [34]

    Estimating llm uncertainty with logits

    Huan Ma, Jingdong Chen, Guangyu Wang, and Changqing Zhang. Estimating llm uncertainty with logits. arXiv e-prints, pp.\ arXiv--2502, 2025

  35. [35]

    Do llms know when to not answer? investigating abstention abilities of large language models

    Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. Do llms know when to not answer? investigating abstention abilities of large language models. arXiv preprint arXiv:2407.16221, 2024

  36. [36]

    Reducing conversational agents’ overconfidence through linguistic calibration

    Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10: 0 857--872, 2022

  37. [37]

    Proof or bluff? evaluating llms on 2025 usa math olympiad

    Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovi \'c , Nikola Jovanovi \'c , and Martin Vechev. Proof or bluff? evaluating llms on 2025 usa math olympiad. arXiv preprint arXiv:2503.21934, 2025

  38. [38]

    Geometric uncertainty for detecting and correcting hallucinations in llms

    Edward Phillips, Sean Wu, Soheila Molaei, Danielle Belgrave, Anshul Thakur, and David Clifton. Geometric uncertainty for detecting and correcting hallucinations in llms. arXiv preprint arXiv:2509.13813, 2025

  39. [39]

    Gustafsson, Sean Wu, Anshul Thakur, and David A

    Edward Phillips, Fredrik K. Gustafsson, Sean Wu, Anshul Thakur, and David A. Clifton. Entropy alone is insufficient for safe selective prediction in llms, 2026 a . URL https://arxiv.org/abs/2603.21172

  40. [40]

    Semantic self-distillation for language model uncertainty

    Edward Phillips, Sean Wu, Boyan Gao, and David A Clifton. Semantic self-distillation for language model uncertainty. arXiv preprint arXiv:2602.04577, 2026 b

  41. [41]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C \' an Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

  42. [42]

    Trust me, i'm wrong: High-certainty hallucinations in llms

    Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. Trust me, i'm wrong: High-certainty hallucinations in llms. arXiv preprint arXiv:2502.12964, 2025

  43. [43]

    Toward expert-level medical question answering with large language models

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models. Nature Medicine, 31 0 (3): 0 943--950, 2025

  44. [44]

    Ai hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content

    Yujie Sun, Dongfang Sheng, Zihan Zhou, and Yifei Wu. Ai hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content. Humanities and Social Sciences Communications, 11 0 (1): 0 1--14, 2024

  45. [45]

    Confidence improves self-consistency in llms

    Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. arXiv preprint arXiv:2502.06233, 2025

  46. [46]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ ...

  47. [47]

    Towards generalist biomedical ai

    Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. Nejm Ai, 1 0 (3): 0 AIoa2300138, 2024

  48. [48]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

  49. [49]

    Truthrl: Incentivizing truthful llms via reinforcement learning

    Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, et al. Truthrl: Incentivizing truthful llms via reinforcement learning. arXiv preprint arXiv:2509.25760, 2025

  50. [50]

    Mitigating llm hallucination via behaviorally calibrated reinforcement learning

    Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, and Wenhao Huang. Mitigating llm hallucination via behaviorally calibrated reinforcement learning. arXiv preprint arXiv:2512.19920, 2025

  51. [51]

    Benchmarking open-source large language models, gpt-4 and claude 2 on multiple-choice questions in nephrology

    Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Zhe Fei, Fabien Scalzo, and Ira Kurtz. Benchmarking open-source large language models, gpt-4 and claude 2 on multiple-choice questions in nephrology. NEJM AI, 1 0 (2): 0 AIdbp2300092, 2024

  52. [52]

    Editing factual knowledge and explanatory ability of medical large language models

    Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Wanyu Wang, Yuyang Ye, Xiangyu Zhao, et al. Editing factual knowledge and explanatory ability of medical large language models. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pp.\ 2660--2670, 2024

  53. [53]

    Are reasoning models more prone to hallucination? arXiv preprint arXiv:2505.23646, 2025

    Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, and Tat-Seng Chua. Are reasoning models more prone to hallucination? arXiv preprint arXiv:2505.23646, 2025

  54. [54]

    Do large language models know what they don’t know? In Findings of the association for Computational Linguistics: ACL 2023, pp.\ 8653--8665, 2023

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuan-Jing Huang. Do large language models know what they don’t know? In Findings of the association for Computational Linguistics: ACL 2023, pp.\ 8653--8665, 2023

  55. [55]

    Cost-saving llm cascades with early abstention

    Michael J Zellinger, Rex Liu, and Matt Thomson. Cost-saving llm cascades with early abstention. arXiv preprint arXiv:2502.09054, 2025

  56. [56]

    A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023

    Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang, Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, et al. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023

  57. [57]

    Reasoning with reinforced functional token tuning

    Kongcheng Zhang, Qi Yao, Baisheng Lai, Jiaxing Huang, Wenkai Fang, Dacheng Tao, Mingli Song, and Shunyu Liu. Reasoning with reinforced functional token tuning. arXiv preprint arXiv:2502.13389, 2025 a

  58. [58]

    TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

    Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, et al. Token-level uncertainty estimation for large language model reasoning. arXiv preprint arXiv:2505.11737, 2025 b