Recognition: 2 theorem links
· Lean TheoremBAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
Pith reviewed 2026-05-13 19:44 UTC · model grok-4.3
The pith
Truthful confidence estimates uniquely maximize expected utility for LLMs deciding when to answer or abstain.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Behavioral Alignment Score aggregates realized utility from an answer-or-abstain model across a continuum of risk thresholds. Truthful confidence estimates uniquely maximize expected BAS utility. Unlike symmetric proper scoring rules such as log loss, BAS imposes a stronger penalty on overconfident errors than on underconfident ones. Empirical results show that models with similar ECE or AURC can differ markedly in BAS due to highly overconfident mistakes, and that interventions such as top-k elicitation and post-hoc calibration raise BAS values.
What carries the argument
The Behavioral Alignment Score (BAS), computed by integrating utility over risk thresholds from an answer-or-abstain decision model.
If this is right
- Larger and more accurate models tend to achieve higher BAS.
- Models with similar ECE or AURC can exhibit very different BAS because of highly overconfident errors.
- Even frontier models remain prone to severe overconfidence on some tasks.
- Top-k confidence elicitation and post-hoc calibration can meaningfully improve BAS.
- BAS reveals decision-useful differences in confidence that standard metrics miss.
Where Pith is reading between the lines
- BAS could be adapted to set automatic abstention thresholds in high-stakes domains where error costs are known in advance.
- The asymmetric penalty structure suggests that loss functions emphasizing overconfidence reduction may improve downstream decision performance more than symmetric calibration losses.
- Extending the utility model to sequential or multi-step tasks would allow BAS-style evaluation of confidence in agentic settings.
Load-bearing premise
The chosen answer-or-abstain utility model accurately reflects the real costs and risk preferences that matter in downstream applications.
What would settle it
A model whose confidence estimates are systematically non-truthful yet produces strictly higher expected BAS than a truthful model under the same utility function would falsify the uniqueness claim.
Figures
read the original abstract
Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Behavioral Alignment Score (BAS), a decision-theoretic metric derived from an explicit answer-or-abstain utility model that aggregates realized utility across a continuum of risk thresholds. It claims to prove that truthful confidence estimates uniquely maximize expected BAS utility, relates BAS to proper scoring rules while highlighting its asymmetric penalty for overconfident errors, and reports a benchmark across LLMs and tasks showing that BAS distinguishes models with similar ECE/AURC values, that larger models tend to score higher, and that interventions like top-k elicitation improve reliability.
Significance. If the uniqueness result can be shown to hold beyond the specific parametric utility family, BAS would offer a principled link between calibration and downstream decision utility that standard symmetric metrics lack. The benchmark approach could help identify practically relevant differences in confidence reliability for abstention-aware applications.
major comments (3)
- [Abstract, §3] Abstract and §3 (theoretical result): the claim that truthful confidence 'uniquely maximizes' expected BAS utility follows directly from the construction of BAS as the integral of utility under the fixed answer-or-abstain model (correct = positive, incorrect = negative, abstain = zero); the manuscript must supply the explicit derivation steps and the utility function parameters to allow assessment of whether uniqueness survives perturbations in the relative cost of false positives versus false negatives.
- [§4] §4 (benchmark): the text states that models with similar ECE or AURC exhibit 'very different BAS' due to overconfident errors, yet provides no quantitative BAS values, confidence intervals, or controls for task difficulty and model size; without these numbers the claim that BAS reveals limitations of standard metrics cannot be evaluated.
- [§3] §3 (relation to proper scoring rules): the structural difference from log loss is asserted (asymmetric penalty for overconfidence), but no explicit comparison of the scoring functions or proof that BAS rankings differ from log-loss rankings on the same confidence distributions is given.
minor comments (2)
- [§2] Notation for the risk-threshold continuum and the integration limits should be defined explicitly in the main text rather than deferred to an appendix.
- [Abstract, §4] The abstract mentions 'frontier models' without naming them; the benchmark section should list the exact models and tasks evaluated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential value of BAS as a decision-theoretic metric linking calibration to abstention decisions. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additions.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (theoretical result): the claim that truthful confidence 'uniquely maximizes' expected BAS utility follows directly from the construction of BAS as the integral of utility under the fixed answer-or-abstain model (correct = positive, incorrect = negative, abstain = zero); the manuscript must supply the explicit derivation steps and the utility function parameters to allow assessment of whether uniqueness survives perturbations in the relative cost of false positives versus false negatives.
Authors: We agree that explicit derivation steps and parameter details will improve accessibility. The uniqueness follows because any deviation from truthful confidence p* creates a positive-measure set of thresholds where the model answers when it should abstain (or vice versa), incurring negative expected utility. In the revision we will add a full step-by-step derivation in §3, defining the per-threshold utility as U(answer|correct)=+1, U(answer|incorrect)=-c (c>0), U(abstain)=0, then showing that the integral of expected utility over thresholds [0,1] is strictly maximized only at p*. We will also include a short sensitivity analysis demonstrating that uniqueness is preserved under any c>0 and under small additive perturbations to the utility values. revision: yes
-
Referee: [§4] §4 (benchmark): the text states that models with similar ECE or AURC exhibit 'very different BAS' due to overconfident errors, yet provides no quantitative BAS values, confidence intervals, or controls for task difficulty and model size; without these numbers the claim that BAS reveals limitations of standard metrics cannot be evaluated.
Authors: We acknowledge the need for quantitative support. The revised §4 will include a main results table reporting exact BAS values (with 95% bootstrap confidence intervals) for every model-task pair, alongside ECE and AURC. We will add two new analyses: (i) stratification by task difficulty (binned by model accuracy) and (ii) regression controls for model size (log parameters) to isolate the contribution of confidence reliability. These additions will make the claim that BAS distinguishes models with similar ECE/AURC directly verifiable from the reported numbers. revision: yes
-
Referee: [§3] §3 (relation to proper scoring rules): the structural difference from log loss is asserted (asymmetric penalty for overconfidence), but no explicit comparison of the scoring functions or proof that BAS rankings differ from log-loss rankings on the same confidence distributions is given.
Authors: We will expand §3 with the requested explicit comparison. We will first write the BAS integrand as a threshold-dependent 0-1 loss with asymmetric cost, contrasting it with the symmetric -log(p) and -(1-p) terms of log loss. We will then supply a short proof that BAS and log-loss rankings can differ: for any confidence distribution containing overconfident errors above a critical threshold, the integral penalizes those errors more heavily than log loss, producing a strict ranking inversion. A small synthetic counter-example (two-point confidence distribution) will be added to illustrate the divergence numerically. revision: yes
Circularity Check
BAS uniqueness result reduces to definitional property of the chosen utility model
specific steps
-
self definitional
[Abstract]
"BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds... We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior."
BAS is constructed by integrating utility under the fixed answer-or-abstain model; therefore the statement that the confidence matching true probabilities (truthful) uniquely maximizes expected BAS is true by definition of the score, not an independent theoretical finding. Any other confidence estimator would yield lower expected utility inside this exact utility family by construction.
full rationale
The paper defines BAS directly from an explicit answer-or-abstain utility model (correct answer positive utility, incorrect negative, abstain zero) and then claims a theoretical result that truthful confidence uniquely maximizes expected BAS. This uniqueness holds by construction inside the chosen parametric family, as BAS is the integral of realized utility under that exact model. The abstract provides no independent derivation, external benchmark, or robustness check that would make the result non-tautological. No self-citations, fitted predictions, or imported uniqueness theorems are load-bearing in the given text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An explicit answer-or-abstain utility model exists that captures the relevant decision risks and preferences.
invented entities (1)
-
Behavioral Alignment Score (BAS)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel contradictsWe show theoretically that truthful confidence estimates uniquely maximize expected BAS utility... St(Z,a) = 1 (correct), −t/(1−t) (incorrect), 0 (abstain); U(s,Z) = s or s + ln(1−s)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclearBAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds
Forward citations
Cited by 2 Pith papers
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report
Validity indices adapted from clinical assessment classify four frontier LLMs as construct-level invalid on metacognitive probes, with valid models showing positive item-sensitive confidence (r=.18) while invalid ones...
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
-
[5]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S \'e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Rethinking the uncertainty: A critical review and analysis in the era of large language models, 2024
Mohammad Beigi, Sijia Wang, Ying Shen, Zihao Lin, Adithya Kulkarni, Jianfeng He, Feng Chen, Ming Jin, Jin-Hee Cho, Dawei Zhou, Chang-Tien Lu, and Lifu Huang. Rethinking the uncertainty: A critical review and analysis in the era of large language models, 2024. URL https://arxiv.org/abs/2410.20199
-
[8]
Meditron-70b: Scaling medical pretraining for large language models,
Zeming Chen, Alejandro Hern \'a ndez Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K \"o pf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023
-
[9]
On optimum recognition error and reject tradeoff
C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16 0 (1): 0 41--46, 2003
work page 2003
-
[10]
Hoang Anh Dang, Vu Tran, and Le-Minh Nguyen. Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior. Frontiers in Artificial Intelligence, 8: 0 1622292, 2025
work page 2025
-
[11]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pp.\ arXiv--2407, 2024
work page 2024
-
[12]
On the foundations of noise-free selective classification
Ran El-Yaniv et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11 0 (5), 2010
work page 2010
-
[13]
Detecting hallucinations in large language models using semantic entropy
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630 0 (8017): 0 625--630, 2024
work page 2024
-
[14]
Farieda Gaber, Maqsood Shaik, Fabio Allega, Agnes Julia Bilecz, Felix Busch, Kelsey Goon, Vedran Franke, and Altuna Akalin. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digital Medicine, 8 0 (1): 0 263, 2025
work page 2025
-
[15]
Selective classification for deep neural networks
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017
work page 2017
-
[16]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. The Innovation, 2024
work page 2024
-
[17]
Assessment of large language models (llms) in decision-making support for gynecologic oncology
Khanisyah Erza Gumilar, Birama R Indraprasta, Ach Salman Faridzi, Bagus M Wibowo, Aditya Herlambang, Eccita Rahestyningtyas, Budi Irawan, Zulkarnain Tambunan, Ahmad Fadhli Bustomi, Bagus Ngurah Brahmantara, et al. Assessment of large language models (llms) in decision-making support for gynecologic oncology. Computational and Structural Biotechnology Jour...
work page 2024
-
[18]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025
work page 2025
-
[19]
Evaluation and mitigation of the limitations of large language models in clinical decision-making
Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30 0 (9): 0 2613--2622, 2024
work page 2024
-
[20]
Investigating uncertainty calibration of aligned language models under the multiple-choice setting
Guande He, Peng Cui, Jianfei Chen, Wenbo Hu, and Jun Zhu. Investigating uncertainty calibration of aligned language models under the multiple-choice setting. arXiv preprint arXiv:2310.11732, 2023
-
[21]
Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yuxuan Gu, Yangfan Ye, Liang Zhao, Weihong Zhong, Baoxin Wang, et al. Alleviating hallucinations from knowledge misalignment in large language models via selective abstention learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...
work page 2025
-
[22]
ACM Transactions on Information Systems , author =
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43 0 (2): 0 1–55, January 2025 b . ISSN 1558-2868. doi:10.1145...
-
[23]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Refusal tokens: A simple way to calibrate refusals in large language models
Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, and Tom Goldstein. Refusal tokens: A simple way to calibrate refusals in large language models. arXiv preprint arXiv:2412.06748, 2024
-
[25]
Ai hallucinations can't be stopped—but these techniques can limit their damage
Nicola Jones. Ai hallucinations can't be stopped—but these techniques can limit their damage. Nature, 637 0 (8047): 0 778--780, 2025
work page 2025
-
[26]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Why Language Models Hallucinate
Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate. arXiv preprint arXiv:2509.04664, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Large language models must be taught to know what they don’t know
Sanyam Kapoor, Nate Gruver, Manley Roberts, Katie Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know. Advances in Neural Information Processing Systems, 37: 0 85932--85972, 2024
work page 2024
-
[29]
Abstentionbench: Reasoning llms fail on unanswerable questions
Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J Bell. Abstentionbench: Reasoning llms fail on unanswerable questions. arXiv preprint arXiv:2506.09038, 2025
-
[30]
Semantic volume: Quantifying and detecting both external and internal uncertainty in llms, 2025
Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, and Anurag Beniwal. Semantic volume: Quantifying and detecting both external and internal uncertainty in llms, 2025. URL https://arxiv.org/abs/2502.21239
-
[31]
Teaching models to express their uncertainty in words
Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022
-
[32]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sad \'e , Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3. arXiv preprint arXiv:2601.08584, 2026
work page internal anchor Pith review arXiv 2026
-
[34]
Estimating llm uncertainty with logits
Huan Ma, Jingdong Chen, Guangyu Wang, and Changqing Zhang. Estimating llm uncertainty with logits. arXiv e-prints, pp.\ arXiv--2502, 2025
work page 2025
-
[35]
Do llms know when to not answer? investigating abstention abilities of large language models
Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. Do llms know when to not answer? investigating abstention abilities of large language models. arXiv preprint arXiv:2407.16221, 2024
-
[36]
Reducing conversational agents’ overconfidence through linguistic calibration
Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10: 0 857--872, 2022
work page 2022
-
[37]
Proof or bluff? evaluating llms on 2025 usa math olympiad
Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovi \'c , Nikola Jovanovi \'c , and Martin Vechev. Proof or bluff? evaluating llms on 2025 usa math olympiad. arXiv preprint arXiv:2503.21934, 2025
-
[38]
Geometric uncertainty for detecting and correcting hallucinations in llms
Edward Phillips, Sean Wu, Soheila Molaei, Danielle Belgrave, Anshul Thakur, and David Clifton. Geometric uncertainty for detecting and correcting hallucinations in llms. arXiv preprint arXiv:2509.13813, 2025
-
[39]
Gustafsson, Sean Wu, Anshul Thakur, and David A
Edward Phillips, Fredrik K. Gustafsson, Sean Wu, Anshul Thakur, and David A. Clifton. Entropy alone is insufficient for safe selective prediction in llms, 2026 a . URL https://arxiv.org/abs/2603.21172
-
[40]
Semantic self-distillation for language model uncertainty
Edward Phillips, Sean Wu, Boyan Gao, and David A Clifton. Semantic self-distillation for language model uncertainty. arXiv preprint arXiv:2602.04577, 2026 b
-
[41]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C \' an Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Trust me, i'm wrong: High-certainty hallucinations in llms
Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. Trust me, i'm wrong: High-certainty hallucinations in llms. arXiv preprint arXiv:2502.12964, 2025
-
[43]
Toward expert-level medical question answering with large language models
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models. Nature Medicine, 31 0 (3): 0 943--950, 2025
work page 2025
-
[44]
Yujie Sun, Dongfang Sheng, Zihan Zhou, and Yifei Wu. Ai hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content. Humanities and Social Sciences Communications, 11 0 (1): 0 1--14, 2024
work page 2024
-
[45]
Confidence improves self-consistency in llms
Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. arXiv preprint arXiv:2502.06233, 2025
-
[46]
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ ...
work page 2023
-
[47]
Towards generalist biomedical ai
Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. Nejm Ai, 1 0 (3): 0 AIoa2300138, 2024
work page 2024
-
[48]
Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024
-
[49]
Truthrl: Incentivizing truthful llms via reinforcement learning
Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, et al. Truthrl: Incentivizing truthful llms via reinforcement learning. arXiv preprint arXiv:2509.25760, 2025
-
[50]
Mitigating llm hallucination via behaviorally calibrated reinforcement learning
Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, and Wenhao Huang. Mitigating llm hallucination via behaviorally calibrated reinforcement learning. arXiv preprint arXiv:2512.19920, 2025
-
[51]
Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Zhe Fei, Fabien Scalzo, and Ira Kurtz. Benchmarking open-source large language models, gpt-4 and claude 2 on multiple-choice questions in nephrology. NEJM AI, 1 0 (2): 0 AIdbp2300092, 2024
work page 2024
-
[52]
Editing factual knowledge and explanatory ability of medical large language models
Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Wanyu Wang, Yuyang Ye, Xiangyu Zhao, et al. Editing factual knowledge and explanatory ability of medical large language models. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pp.\ 2660--2670, 2024
work page 2024
-
[53]
Are reasoning models more prone to hallucination? arXiv preprint arXiv:2505.23646, 2025
Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, and Tat-Seng Chua. Are reasoning models more prone to hallucination? arXiv preprint arXiv:2505.23646, 2025
-
[54]
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuan-Jing Huang. Do large language models know what they don’t know? In Findings of the association for Computational Linguistics: ACL 2023, pp.\ 8653--8665, 2023
work page 2023
-
[55]
Cost-saving llm cascades with early abstention
Michael J Zellinger, Rex Liu, and Matt Thomson. Cost-saving llm cascades with early abstention. arXiv preprint arXiv:2502.09054, 2025
-
[56]
Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang, Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, et al. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023
-
[57]
Reasoning with reinforced functional token tuning
Kongcheng Zhang, Qi Yao, Baisheng Lai, Jiaxing Huang, Wenkai Fang, Dacheng Tao, Mingli Song, and Shunyu Liu. Reasoning with reinforced functional token tuning. arXiv preprint arXiv:2502.13389, 2025 a
-
[58]
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, et al. Token-level uncertainty estimation for large language model reasoning. arXiv preprint arXiv:2505.11737, 2025 b
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.