arxiv: 2604.03216 · v1 · submitted 2026-04-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Anshul Thakur, Boyan Gao, David A. Clifton, Edward Phillips, Fredrik K. Gustafsson, Sean Wu

Pith reviewed 2026-05-13 19:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords Behavioral Alignment ScoreLLM confidenceabstentiondecision theorycalibrationoverconfidenceproper scoring rulesevaluation metrics

0 comments

The pith

Truthful confidence estimates uniquely maximize expected utility for LLMs deciding when to answer or abstain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often give confident but wrong answers when abstaining would reduce risk. The paper introduces the Behavioral Alignment Score to evaluate how well confidence guides answer-or-abstain choices under different risk preferences. BAS comes from an explicit utility model that rewards correct answers and penalizes mistakes asymmetrically based on reported confidence. The central theoretical result is that only truthful confidence achieves the highest possible expected BAS across risk thresholds. This matters because common metrics such as expected calibration error treat over- and under-confidence symmetrically and miss the practical cost of overconfident errors.

Core claim

The Behavioral Alignment Score aggregates realized utility from an answer-or-abstain model across a continuum of risk thresholds. Truthful confidence estimates uniquely maximize expected BAS utility. Unlike symmetric proper scoring rules such as log loss, BAS imposes a stronger penalty on overconfident errors than on underconfident ones. Empirical results show that models with similar ECE or AURC can differ markedly in BAS due to highly overconfident mistakes, and that interventions such as top-k elicitation and post-hoc calibration raise BAS values.

What carries the argument

The Behavioral Alignment Score (BAS), computed by integrating utility over risk thresholds from an answer-or-abstain decision model.

If this is right

Larger and more accurate models tend to achieve higher BAS.
Models with similar ECE or AURC can exhibit very different BAS because of highly overconfident errors.
Even frontier models remain prone to severe overconfidence on some tasks.
Top-k confidence elicitation and post-hoc calibration can meaningfully improve BAS.
BAS reveals decision-useful differences in confidence that standard metrics miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

BAS could be adapted to set automatic abstention thresholds in high-stakes domains where error costs are known in advance.
The asymmetric penalty structure suggests that loss functions emphasizing overconfidence reduction may improve downstream decision performance more than symmetric calibration losses.
Extending the utility model to sequential or multi-step tasks would allow BAS-style evaluation of confidence in agentic settings.

Load-bearing premise

The chosen answer-or-abstain utility model accurately reflects the real costs and risk preferences that matter in downstream applications.

What would settle it

A model whose confidence estimates are systematically non-truthful yet produces strictly higher expected BAS than a truthful model under the same utility function would falsify the uniqueness claim.

Figures

Figures reproduced from arXiv: 2604.03216 by Anshul Thakur, Boyan Gao, David A. Clifton, Edward Phillips, Fredrik K. Gustafsson, Sean Wu.

**Figure 2.** Figure 2: Relationship between model scale, predictive performance, and confidence relia [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BAS gives a decision-theoretic score for LLM confidence with an asymmetric penalty on overconfidence, but the uniqueness result sticks to their exact utility setup and the paper shows no robustness checks.

read the letter

BAS is a new metric that scores LLM confidence based on how well it supports deciding whether to answer or abstain under varying risk levels. The paper derives it from a simple utility model and shows that truthful confidence is optimal for it, with an asymmetric penalty on overconfidence that log loss doesn't have. They do a decent job highlighting that standard metrics like ECE miss some decision-relevant differences. Their benchmark finds that even good models can be overconfident in ways that hurt BAS, and basic fixes like calibration improve it. That's useful for thinking about safety. The main weakness is that the uniqueness result depends entirely on the exact form of their utility function. If the costs of errors aren't linear or the thresholds work differently, other confidence estimates might match or beat it. The paper doesn't test robustness to changes in those parameters. The assumption that this utility captures real application needs is also unexamined. The empirical part seems to show variation across models, but without seeing the actual numbers and controls it's tough to judge how convincing the distinctions are. This is worth reading for anyone building or evaluating LLMs for settings where wrong answers carry high costs. It gives a concrete way to think about confidence beyond calibration error. I'd send it for peer review so the math and experiments get checked properly.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Behavioral Alignment Score (BAS), a decision-theoretic metric derived from an explicit answer-or-abstain utility model that aggregates realized utility across a continuum of risk thresholds. It claims to prove that truthful confidence estimates uniquely maximize expected BAS utility, relates BAS to proper scoring rules while highlighting its asymmetric penalty for overconfident errors, and reports a benchmark across LLMs and tasks showing that BAS distinguishes models with similar ECE/AURC values, that larger models tend to score higher, and that interventions like top-k elicitation improve reliability.

Significance. If the uniqueness result can be shown to hold beyond the specific parametric utility family, BAS would offer a principled link between calibration and downstream decision utility that standard symmetric metrics lack. The benchmark approach could help identify practically relevant differences in confidence reliability for abstention-aware applications.

major comments (3)

[Abstract, §3] Abstract and §3 (theoretical result): the claim that truthful confidence 'uniquely maximizes' expected BAS utility follows directly from the construction of BAS as the integral of utility under the fixed answer-or-abstain model (correct = positive, incorrect = negative, abstain = zero); the manuscript must supply the explicit derivation steps and the utility function parameters to allow assessment of whether uniqueness survives perturbations in the relative cost of false positives versus false negatives.
[§4] §4 (benchmark): the text states that models with similar ECE or AURC exhibit 'very different BAS' due to overconfident errors, yet provides no quantitative BAS values, confidence intervals, or controls for task difficulty and model size; without these numbers the claim that BAS reveals limitations of standard metrics cannot be evaluated.
[§3] §3 (relation to proper scoring rules): the structural difference from log loss is asserted (asymmetric penalty for overconfidence), but no explicit comparison of the scoring functions or proof that BAS rankings differ from log-loss rankings on the same confidence distributions is given.

minor comments (2)

[§2] Notation for the risk-threshold continuum and the integration limits should be defined explicitly in the main text rather than deferred to an appendix.
[Abstract, §4] The abstract mentions 'frontier models' without naming them; the benchmark section should list the exact models and tasks evaluated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of BAS as a decision-theoretic metric linking calibration to abstention decisions. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (theoretical result): the claim that truthful confidence 'uniquely maximizes' expected BAS utility follows directly from the construction of BAS as the integral of utility under the fixed answer-or-abstain model (correct = positive, incorrect = negative, abstain = zero); the manuscript must supply the explicit derivation steps and the utility function parameters to allow assessment of whether uniqueness survives perturbations in the relative cost of false positives versus false negatives.

Authors: We agree that explicit derivation steps and parameter details will improve accessibility. The uniqueness follows because any deviation from truthful confidence p* creates a positive-measure set of thresholds where the model answers when it should abstain (or vice versa), incurring negative expected utility. In the revision we will add a full step-by-step derivation in §3, defining the per-threshold utility as U(answer|correct)=+1, U(answer|incorrect)=-c (c>0), U(abstain)=0, then showing that the integral of expected utility over thresholds [0,1] is strictly maximized only at p*. We will also include a short sensitivity analysis demonstrating that uniqueness is preserved under any c>0 and under small additive perturbations to the utility values. revision: yes
Referee: [§4] §4 (benchmark): the text states that models with similar ECE or AURC exhibit 'very different BAS' due to overconfident errors, yet provides no quantitative BAS values, confidence intervals, or controls for task difficulty and model size; without these numbers the claim that BAS reveals limitations of standard metrics cannot be evaluated.

Authors: We acknowledge the need for quantitative support. The revised §4 will include a main results table reporting exact BAS values (with 95% bootstrap confidence intervals) for every model-task pair, alongside ECE and AURC. We will add two new analyses: (i) stratification by task difficulty (binned by model accuracy) and (ii) regression controls for model size (log parameters) to isolate the contribution of confidence reliability. These additions will make the claim that BAS distinguishes models with similar ECE/AURC directly verifiable from the reported numbers. revision: yes
Referee: [§3] §3 (relation to proper scoring rules): the structural difference from log loss is asserted (asymmetric penalty for overconfidence), but no explicit comparison of the scoring functions or proof that BAS rankings differ from log-loss rankings on the same confidence distributions is given.

Authors: We will expand §3 with the requested explicit comparison. We will first write the BAS integrand as a threshold-dependent 0-1 loss with asymmetric cost, contrasting it with the symmetric -log(p) and -(1-p) terms of log loss. We will then supply a short proof that BAS and log-loss rankings can differ: for any confidence distribution containing overconfident errors above a critical threshold, the integral penalizes those errors more heavily than log loss, producing a strict ranking inversion. A small synthetic counter-example (two-point confidence distribution) will be added to illustrate the divergence numerically. revision: yes

Circularity Check

1 steps flagged

BAS uniqueness result reduces to definitional property of the chosen utility model

specific steps

self definitional [Abstract]
"BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds... We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior."

BAS is constructed by integrating utility under the fixed answer-or-abstain model; therefore the statement that the confidence matching true probabilities (truthful) uniquely maximizes expected BAS is true by definition of the score, not an independent theoretical finding. Any other confidence estimator would yield lower expected utility inside this exact utility family by construction.

full rationale

The paper defines BAS directly from an explicit answer-or-abstain utility model (correct answer positive utility, incorrect negative, abstain zero) and then claims a theoretical result that truthful confidence uniquely maximizes expected BAS. This uniqueness holds by construction inside the chosen parametric family, as BAS is the integral of realized utility under that exact model. The abstract provides no independent derivation, external benchmark, or robustness check that would make the result non-tautological. No self-citations, fitted predictions, or imported uniqueness theorems are load-bearing in the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central construction rests on one domain assumption (the answer-or-abstain utility model) and introduces one new entity (BAS itself). No free parameters are mentioned.

axioms (1)

domain assumption An explicit answer-or-abstain utility model exists that captures the relevant decision risks and preferences.
BAS is derived from this model; the uniqueness result depends on it.

invented entities (1)

Behavioral Alignment Score (BAS) no independent evidence
purpose: Aggregates realized utility across risk thresholds to measure decision-level reliability of confidence.
Newly defined metric introduced in the paper.

pith-pipeline@v0.9.0 · 5608 in / 1180 out tokens · 37542 ms · 2026-05-13T19:44:24.946873+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel contradicts
We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility... St(Z,a) = 1 (correct), −t/(1−t) (incorrect), 0 (abstain); U(s,Z) = s or s + ln(1−s)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Task-Aware Calibration: Provably Optimal Decoding in LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report
cs.CL 2026-04 conditional novelty 7.0

Validity indices adapted from clinical assessment classify four frontier LLMs as construct-level invalid on metacognitive probes, with valid models showing positive item-sensitive confidence (r=.18) while invalid ones...

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 2 Pith papers · 9 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@+<_7W u kn ] [ө U 7 6ϞG[gG^7;+mM)*B4dnW N>zl0tֹ s/ A䠆VQQXS#Dmy آwS/ ) oyI +Wu ܂齴sθ>Hӫ 9y .xv?t?l6W)'y 63

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv
[5]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S \'e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Rethinking the uncertainty: A critical review and analysis in the era of large language models, 2024

Mohammad Beigi, Sijia Wang, Ying Shen, Zihao Lin, Adithya Kulkarni, Jianfeng He, Feng Chen, Ming Jin, Jin-Hee Cho, Dawei Zhou, Chang-Tien Lu, and Lifu Huang. Rethinking the uncertainty: A critical review and analysis in the era of large language models, 2024. URL https://arxiv.org/abs/2410.20199

work page arXiv 2024
[8]

Meditron-70b: Scaling medical pretraining for large language models,

Zeming Chen, Alejandro Hern \'a ndez Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K \"o pf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023

work page arXiv 2023
[9]

On optimum recognition error and reject tradeoff

C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16 0 (1): 0 41--46, 2003

work page 2003
[10]

Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior

Hoang Anh Dang, Vu Tran, and Le-Minh Nguyen. Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior. Frontiers in Artificial Intelligence, 8: 0 1622292, 2025

work page 2025
[11]

The llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pp.\ arXiv--2407, 2024

work page 2024
[12]

On the foundations of noise-free selective classification

Ran El-Yaniv et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11 0 (5), 2010

work page 2010
[13]

Detecting hallucinations in large language models using semantic entropy

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630 0 (8017): 0 625--630, 2024

work page 2024
[14]

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

Farieda Gaber, Maqsood Shaik, Fabio Allega, Agnes Julia Bilecz, Felix Busch, Kelsey Goon, Vedran Franke, and Altuna Akalin. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digital Medicine, 8 0 (1): 0 263, 2025

work page 2025
[15]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017

work page 2017
[16]

A survey on llm-as-a-judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. The Innovation, 2024

work page 2024
[17]

Assessment of large language models (llms) in decision-making support for gynecologic oncology

Khanisyah Erza Gumilar, Birama R Indraprasta, Ach Salman Faridzi, Bagus M Wibowo, Aditya Herlambang, Eccita Rahestyningtyas, Budi Irawan, Zulkarnain Tambunan, Ahmad Fadhli Bustomi, Bagus Ngurah Brahmantara, et al. Assessment of large language models (llms) in decision-making support for gynecologic oncology. Computational and Structural Biotechnology Jour...

work page 2024
[18]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

work page 2025
[19]

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30 0 (9): 0 2613--2622, 2024

work page 2024
[20]

Investigating uncertainty calibration of aligned language models under the multiple-choice setting

Guande He, Peng Cui, Jianfei Chen, Wenbo Hu, and Jun Zhu. Investigating uncertainty calibration of aligned language models under the multiple-choice setting. arXiv preprint arXiv:2310.11732, 2023

work page arXiv 2023
[21]

Alleviating hallucinations from knowledge misalignment in large language models via selective abstention learning

Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yuxuan Gu, Yangfan Ye, Liang Zhao, Weihong Zhong, Baoxin Wang, et al. Alleviating hallucinations from knowledge misalignment in large language models via selective abstention learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

work page 2025
[22]

ACM Transactions on Information Systems , author =

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43 0 (2): 0 1–55, January 2025 b . ISSN 1558-2868. doi:10.1145...

work page doi:10.1145/3703155 2025
[23]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Refusal tokens: A simple way to calibrate refusals in large language models

Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, and Tom Goldstein. Refusal tokens: A simple way to calibrate refusals in large language models. arXiv preprint arXiv:2412.06748, 2024

work page arXiv 2024
[25]

Ai hallucinations can't be stopped—but these techniques can limit their damage

Nicola Jones. Ai hallucinations can't be stopped—but these techniques can limit their damage. Nature, 637 0 (8047): 0 778--780, 2025

work page 2025
[26]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate. arXiv preprint arXiv:2509.04664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Large language models must be taught to know what they don’t know

Sanyam Kapoor, Nate Gruver, Manley Roberts, Katie Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know. Advances in Neural Information Processing Systems, 37: 0 85932--85972, 2024

work page 2024
[29]

Abstentionbench: Reasoning llms fail on unanswerable questions

Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J Bell. Abstentionbench: Reasoning llms fail on unanswerable questions. arXiv preprint arXiv:2506.09038, 2025

work page arXiv 2025
[30]

Semantic volume: Quantifying and detecting both external and internal uncertainty in llms, 2025

Xiaomin Li, Zhou Yu, Ziji Zhang, Yingying Zhuang, Swair Shah, Narayanan Sadagopan, and Anurag Beniwal. Semantic volume: Quantifying and detecting both external and internal uncertainty in llms, 2025. URL https://arxiv.org/abs/2502.21239

work page arXiv 2025
[31]

Teaching models to express their uncertainty in words

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022

work page arXiv 2022
[32]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sad \'e , Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3. arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review arXiv 2026
[34]

Estimating llm uncertainty with logits

Huan Ma, Jingdong Chen, Guangyu Wang, and Changqing Zhang. Estimating llm uncertainty with logits. arXiv e-prints, pp.\ arXiv--2502, 2025

work page 2025
[35]

Do llms know when to not answer? investigating abstention abilities of large language models

Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. Do llms know when to not answer? investigating abstention abilities of large language models. arXiv preprint arXiv:2407.16221, 2024

work page arXiv 2024
[36]

Reducing conversational agents’ overconfidence through linguistic calibration

Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10: 0 857--872, 2022

work page 2022
[37]

Proof or bluff? evaluating llms on 2025 usa math olympiad

Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovi \'c , Nikola Jovanovi \'c , and Martin Vechev. Proof or bluff? evaluating llms on 2025 usa math olympiad. arXiv preprint arXiv:2503.21934, 2025

work page arXiv 2025
[38]

Geometric uncertainty for detecting and correcting hallucinations in llms

Edward Phillips, Sean Wu, Soheila Molaei, Danielle Belgrave, Anshul Thakur, and David Clifton. Geometric uncertainty for detecting and correcting hallucinations in llms. arXiv preprint arXiv:2509.13813, 2025

work page arXiv 2025
[39]

Gustafsson, Sean Wu, Anshul Thakur, and David A

Edward Phillips, Fredrik K. Gustafsson, Sean Wu, Anshul Thakur, and David A. Clifton. Entropy alone is insufficient for safe selective prediction in llms, 2026 a . URL https://arxiv.org/abs/2603.21172

work page arXiv 2026
[40]

Semantic self-distillation for language model uncertainty

Edward Phillips, Sean Wu, Boyan Gao, and David A Clifton. Semantic self-distillation for language model uncertainty. arXiv preprint arXiv:2602.04577, 2026 b

work page arXiv 2026
[41]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C \' an Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Trust me, i'm wrong: High-certainty hallucinations in llms

Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. Trust me, i'm wrong: High-certainty hallucinations in llms. arXiv preprint arXiv:2502.12964, 2025

work page arXiv 2025
[43]

Toward expert-level medical question answering with large language models

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models. Nature Medicine, 31 0 (3): 0 943--950, 2025

work page 2025
[44]

Ai hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content

Yujie Sun, Dongfang Sheng, Zihan Zhou, and Yifei Wu. Ai hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content. Humanities and Social Sciences Communications, 11 0 (1): 0 1--14, 2024

work page 2024
[45]

Confidence improves self-consistency in llms

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. arXiv preprint arXiv:2502.06233, 2025

work page arXiv 2025
[46]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ ...

work page 2023
[47]

Towards generalist biomedical ai

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. Nejm Ai, 1 0 (3): 0 AIoa2300138, 2024

work page 2024
[48]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

work page arXiv 2024
[49]

Truthrl: Incentivizing truthful llms via reinforcement learning

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, et al. Truthrl: Incentivizing truthful llms via reinforcement learning. arXiv preprint arXiv:2509.25760, 2025

work page arXiv 2025
[50]

Mitigating llm hallucination via behaviorally calibrated reinforcement learning

Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, and Wenhao Huang. Mitigating llm hallucination via behaviorally calibrated reinforcement learning. arXiv preprint arXiv:2512.19920, 2025

work page arXiv 2025
[51]

Benchmarking open-source large language models, gpt-4 and claude 2 on multiple-choice questions in nephrology

Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Zhe Fei, Fabien Scalzo, and Ira Kurtz. Benchmarking open-source large language models, gpt-4 and claude 2 on multiple-choice questions in nephrology. NEJM AI, 1 0 (2): 0 AIdbp2300092, 2024

work page 2024
[52]

Editing factual knowledge and explanatory ability of medical large language models

Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Wanyu Wang, Yuyang Ye, Xiangyu Zhao, et al. Editing factual knowledge and explanatory ability of medical large language models. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pp.\ 2660--2670, 2024

work page 2024
[53]

Are reasoning models more prone to hallucination? arXiv preprint arXiv:2505.23646, 2025

Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, and Tat-Seng Chua. Are reasoning models more prone to hallucination? arXiv preprint arXiv:2505.23646, 2025

work page arXiv 2025
[54]

Do large language models know what they don’t know? In Findings of the association for Computational Linguistics: ACL 2023, pp.\ 8653--8665, 2023

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuan-Jing Huang. Do large language models know what they don’t know? In Findings of the association for Computational Linguistics: ACL 2023, pp.\ 8653--8665, 2023

work page 2023
[55]

Cost-saving llm cascades with early abstention

Michael J Zellinger, Rex Liu, and Matt Thomson. Cost-saving llm cascades with early abstention. arXiv preprint arXiv:2502.09054, 2025

work page arXiv 2025
[56]

A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023

Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang, Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, et al. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023

work page arXiv 2023
[57]

Reasoning with reinforced functional token tuning

Kongcheng Zhang, Qi Yao, Baisheng Lai, Jiaxing Huang, Wenkai Fang, Dacheng Tao, Mingli Song, and Shunyu Liu. Reasoning with reinforced functional token tuning. arXiv preprint arXiv:2502.13389, 2025 a

work page arXiv 2025
[58]

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, et al. Token-level uncertainty estimation for large language model reasoning. arXiv preprint arXiv:2505.11737, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025