$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

Erik Ernst; Lautaro Estienne; Luciana Ferrer; Mat\'ias Vera; Pablo Piantanida

arxiv: 2605.20490 · v2 · pith:IP7H37WAnew · submitted 2026-05-19 · 💻 cs.AI · cs.LG

ECUAS_n: A family of metrics for principled evaluation of uncertainty-augmented systems

Lautaro Estienne , Erik Ernst , Mat\'ias Vera , Pablo Piantanida , Luciana Ferrer This is my paper

Pith reviewed 2026-05-22 08:51 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords uncertainty quantificationproper scoring rulesevaluation metricsuncertainty-augmented systemsdecision making under uncertaintyclassificationquestion answering

0 comments

The pith

ECUAS_n metrics evaluate uncertainty-augmented systems as proper scoring rules that balance prediction errors and uncertainty quality via one tunable parameter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluation of uncertainty-augmented systems often splits predictions and uncertainties into separate scores, fixes rejection costs arbitrarily, or integrates over coverage-risk curves. The paper argues these approaches fail to assess the system's overall value for downstream decisions where uncertainty guides accept-or-reject choices. It introduces the ECUAS_n family of metrics, each a proper scoring rule for the task at hand, with n setting the relative penalty for wrong predictions versus imperfect uncertainty estimates. A sympathetic reader would care because high-stakes applications need one number that directly reflects decision utility rather than a collection of proxy scores. The authors support the claim with theoretical properties of proper scoring rules and experiments on classification and generation datasets including a human-annotated TriviaQA subset.

Core claim

The ECUAS_n family of metrics, formulated as proper scoring rules for the task of interest, provides a more adequate assessment of the overall performance of uncertainty-augmented systems for decision making under uncertainty than current approaches using separate metrics, fixed rejection costs, or coverage-risk curves. The parameter n controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case.

What carries the argument

The ECUAS_n metric, a parameterized proper scoring rule that combines prediction accuracy and uncertainty quality into one score, with n controlling the relative cost of errors versus bad uncertainty estimates.

If this is right

UA systems can be ranked and selected for a concrete use-case simply by picking the n that matches its cost structure.
Differences in system quality that are invisible to separate accuracy and uncertainty metrics become visible in the combined score.
Training or post-processing choices can be guided by direct optimization toward the metric that will be used at deployment.
Comparisons across papers become more reproducible when authors report ECUAS_n at the n values relevant to common decision settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If widely adopted, ECUAS_n could shift model development away from maximizing accuracy alone toward explicitly optimizing the uncertainty that supports downstream decisions.
The approach could be extended to regression or structured prediction tasks by redefining the base proper scoring rule while keeping the same n-controlled trade-off structure.
One could measure whether models trained to minimize ECUAS_n at a target n actually improve real-world utility on a held-out decision policy compared with models trained on standard losses.

Load-bearing premise

A single tunable parameter n can meaningfully capture application-specific cost trade-offs between incorrect predictions and imperfect uncertainties without requiring additional validation or introducing new selection biases in practice.

What would settle it

Run a decision task with known, application-specific rejection costs on held-out data; if the system ranked best by ECUAS_n for the matching n does not produce the highest expected utility when users reject according to its uncertainty scores, the claim of superior adequacy would be falsified.

Figures

Figures reproduced from arXiv: 2605.20490 by Erik Ernst, Lautaro Estienne, Luciana Ferrer, Mat\'ias Vera, Pablo Piantanida.

**Figure 1.** Figure 1: C ∗ n as a function of the confidence qe, when candidate decisions are correct (solid lines) and incorrect (dashed lines), for different values of n, the parameter in w, and K, the number of classes. 3 Application of ECUAS to generative systems An important family of UA systems is that based on generative models [30, 87, 58]. To use the ECUAS metrics in this scenario, we need to adapt the definition of C˜.… view at source ↗

**Figure 2.** Figure 2: ECUASn values when temperature scaling is applied to the calibrated version of q and the candidate answer is obtained by sampling from the resulting distribution. Our evaluation spans multiple state-of-the-art small LLMs, Qwen 3.5 (4B and 9B) [75], GLM-4.6VFlash [76], Ministral-3-8B-Instruct-2512 [56], as well as larger models from the Gemini 2.5 family (Flash Lite, Flash and Pro) [12]. We evaluate these … view at source ↗

read the original abstract

In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users -- human or downstream systems -- to accept or reject predictions based on application-specific cost trade-offs. Such uncertainty-augmented (UA) systems -- i.e., systems that output both predictions and uncertainty scores -- are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve. We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, $ECUAS_n$, formulated as proper scoring rules for the task of interest. The parameter $n$ controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case. We demonstrate the advantages of the $ECUAS_n$ metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ECUAS_n gives a single proper scoring rule for uncertainty-augmented systems with one tunable n, but choosing that n may still need validation data and could reintroduce the tuning issues the paper wants to avoid.

read the letter

The main thing to know is that this paper introduces ECUAS_n, a family of proper scoring rules that score an uncertainty-augmented system on both its predictions and its uncertainty estimates in one number, with the parameter n setting the relative cost between wrong answers and bad uncertainty reports. The authors argue this is better than running separate metrics, fixing a rejection cost, or integrating over coverage-risk curves, and they show the idea on classification and generation tasks including a hand-labeled TriviaQA subset. What the work does well is name a real practical gap: when a system must support accept/reject decisions, splitting the evaluation misses how prediction quality and uncertainty quality interact for the actual downstream cost. Framing the whole thing as a proper scoring rule is a clean move because those rules have known calibration properties, and the experiments across several datasets give at least initial evidence that the metric behaves sensibly. The soft spot is exactly the one the stress-test note flags. If setting n for a new use case still requires held-out data, expert tuning, or knowledge of the application costs, then the selection step itself can introduce bias or extra validation work, which undercuts the claimed advantage over fixed-cost or coverage-based methods. The abstract does not show a default or data-free way to pick n, so the practical edge is not yet obvious. I would also want to see the explicit derivations that establish the proper scoring property for this specific UA task. This paper is for people who evaluate or deploy systems that output both a prediction and a score in high-stakes settings. A reader who already works with proper scoring rules or selective prediction will get the most out of the framework. It deserves a serious referee because the core proposal is coherent and the authors engage the existing literature rather than ignoring it. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes the ECUAS_n family of metrics for evaluating uncertainty-augmented (UA) systems that output both predictions and uncertainty scores. It argues that current practices—separate metrics for predictions and uncertainties, fixed rejection costs, or coverage-risk curves—are inadequate for assessing overall performance in decision-making under uncertainty. ECUAS_n is formulated as proper scoring rules with a single tunable parameter n that controls the trade-off between the cost of incorrect predictions and imperfect uncertainties according to use-case needs. Theoretical advantages and empirical results are presented on classification and generation tasks, including a manually annotated subset of TriviaQA.

Significance. If the proper-scoring-rule formulation holds and the empirical comparisons demonstrate clear, bias-free advantages, the work could establish a more unified and application-adaptable standard for evaluating UA systems in high-stakes settings, reducing reliance on fragmented or arbitrarily parameterized evaluation protocols.

major comments (2)

[§3] §3 (theoretical formulation): the claim that ECUAS_n constitutes a proper scoring rule for the joint prediction-uncertainty task requires an explicit derivation showing that the expected score is minimized precisely when both the prediction is correct and the uncertainty is well-calibrated; without this, the asserted superiority over separate metrics or fixed-cost approaches remains unsubstantiated.
[§5] §5 (empirical evaluation, TriviaQA experiments): the procedure for selecting or tuning n is not shown to avoid the very selection bias the paper criticizes in fixed-rejection-cost methods; if n is chosen on held-out data or expert knowledge of downstream costs, the metric loses its claimed advantage of being a single, principled scalar.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly list all datasets used beyond the TriviaQA subset to allow immediate assessment of diversity.
[§3] Notation for the scoring rule should be introduced with a single running example before the general n-parameterized form to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments in detail below and outline the revisions we will make to improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [§3] §3 (theoretical formulation): the claim that ECUAS_n constitutes a proper scoring rule for the joint prediction-uncertainty task requires an explicit derivation showing that the expected score is minimized precisely when both the prediction is correct and the uncertainty is well-calibrated; without this, the asserted superiority over separate metrics or fixed-cost approaches remains unsubstantiated.

Authors: We acknowledge that the current manuscript would benefit from a more explicit derivation of the proper scoring rule property. In the revised version, we will expand Section 3 to include a step-by-step derivation proving that the expected value of ECUAS_n is minimized if and only if the prediction is correct and the uncertainty is perfectly calibrated to the true posterior. This will directly address the concern and provide a stronger theoretical basis for the metric's advantages. revision: yes
Referee: [§5] §5 (empirical evaluation, TriviaQA experiments): the procedure for selecting or tuning n is not shown to avoid the very selection bias the paper criticizes in fixed-rejection-cost methods; if n is chosen on held-out data or expert knowledge of downstream costs, the metric loses its claimed advantage of being a single, principled scalar.

Authors: We appreciate this point and agree that the selection of n must be handled carefully to maintain the principled nature of the metric. Unlike fixed-rejection-cost methods where the cost parameter is often chosen arbitrarily or tuned on data, n in ECUAS_n is meant to reflect the relative costs in the specific use-case, which can be determined from domain expertise or cost-benefit analysis without reference to the evaluation data. To clarify this, we will add a subsection in the revised Section 5 discussing guidelines for choosing n based on application requirements, along with empirical sensitivity analyses showing results for a range of n values on the TriviaQA experiments. This preserves the advantage of a single scalar while making the choice transparent and use-case driven. revision: yes

Circularity Check

0 steps flagged

ECUAS_n defined as proper scoring rules with independent theoretical and empirical support

full rationale

The paper formulates ECUAS_n directly as a family of proper scoring rules for uncertainty-augmented decision making, with n as an explicit tunable parameter for cost trade-offs. It contrasts this with separate metrics or fixed-cost approaches and validates via theoretical properties plus experiments on external datasets (e.g., TriviaQA). No derivation step reduces a claimed prediction or uniqueness result to a fitted input, self-citation chain, or ansatz imported from prior work by the same authors. The central claim rests on the proper-scoring-rule construction and external empirical checks rather than self-referential definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central addition is the new metric family with the tunable n; the work rests on the domain assumption that proper scoring rules are suitable for this joint evaluation task and on standard ML evaluation practices.

free parameters (1)

n
Controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the use-case.

axioms (1)

domain assumption Proper scoring rules are the appropriate framework for assessing overall performance of uncertainty-augmented systems in decision making under uncertainty.
Invoked to justify why ECUAS_n is principled compared to prior separate or fixed-cost methods.

pith-pipeline@v0.9.0 · 5730 in / 1236 out tokens · 37821 ms · 2026-05-22T08:51:28.927520+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel family of metrics, ECUAS_n, formulated as proper scoring rules... The parameter n controls the trade-off between the cost of incorrect predictions and imperfect uncertainties
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

C^*_n(y,q) obtained by integrating w_n(γ)C^*_γ over γ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 5 internal anchors

[1]

Ashukha, A

A. Ashukha, A. Lyzhov, D. Molchanov, and D. Vetrov. Pitfalls of in-domain uncertainty estima- tion and ensembling in deep learning. InInternational Conference on Learning Representations,

work page
[2]

URLhttps://openreview.net/forum?id=BJxI5gHKDr

work page
[3]

P. L. Bartlett and M. H. Wegkamp. Classification with a reject option using a hinge loss.J. Mach. Learn. Res., 9:1823–1840, 2008. URL https://api.semanticscholar.org/CorpusID: 16963069

work page 2008
[4]

Brummer.Measuring, refining and calibrating speaker and language information extracted from speech

N. Brummer.Measuring, refining and calibrating speaker and language information extracted from speech. PhD thesis, University of Stellenbosch, 2010. URL https://scholar.sun.ac. za/items/1b46805b-2b1e-46aa-83ce-75ede92f0159

work page 2010
[5]

Brümmer.Measuring, Refining and Calibrating Speaker and Language Information Ex- tracted from Speech

N. Brümmer.Measuring, Refining and Calibrating Speaker and Language Information Ex- tracted from Speech. PhD thesis, Stellenbosch University, 2010

work page 2010
[6]

T. J. Bungert, L. Kobelke, and P. F. Jaeger. Understanding silent failures in medical image classification. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 400–410. Springer, 2023

work page 2023
[7]

Busso, M

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower Provost, S. Kim, J. Chang, S. Lee, and S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42:335–359, 12 2008. doi: 10.1007/s10579-008-9076-6

work page doi:10.1007/s10579-008-9076-6 2008
[8]

L. F. P. Cattelan and D. Silva. How to fix a broken confidence estimator: Evaluating post- hoc methods for selective classification with deep neural networks. InThe 40th Conference on Uncertainty in Artificial Intelligence, 2024. URL https://openreview.net/forum?id= IJBWLRCvYX

work page 2024
[9]

J. Cen, D. Luan, S. Zhang, Y . Pei, Y . Zhang, D. Zhao, S. Shen, and Q. Chen. The devil is in the wrongly-classified samples: Towards unified open-set recognition.arXiv preprint arXiv:2302.04002, 2023

work page arXiv 2023
[10]

Charoenphakdee, Z

N. Charoenphakdee, Z. Cui, Y . Zhang, and M. Sugiyama. Classification with rejection based on cost-sensitive classification. InInternational Conference on Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:225041187

work page 2020
[11]

Cheng, X.-Y

Z. Cheng, X.-Y . Zhang, and C.-L. Liu. Unified classification and rejection: A one-versus-all framework.arXiv preprint arXiv:2311.13355, 2023

work page arXiv 2023
[12]

C. K. Chow. An optimum character recognition system using decision functions.IRE Trans- actions on Electronic Computers, EC-6(4):247–254, Dec. 1957. ISSN 0367-9950. doi: 10.1109/TEC.1957.5222035. URLhttps://ieeexplore.ieee.org/document/5222035

work page doi:10.1109/tec.1957.5222035 1957
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

A. P. Dawid and M. Musio. Theory and applications of proper scoring rules.METRON, 72(2): 169–183, Apr 2014. ISSN 2281-695X

work page 2014
[15]

Y . Ding, J. Liu, J. Xiong, and Y . Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4–5, 2020

work page 2020
[16]

J. Duan, H. Cheng, S. Wang, A. Zavalny, C. Wang, R. Xu, B. Kailkhura, and K. Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics, Bangkok, Thailand, Aug. 2024. URL https://aclanthology.org/2024....

work page 2024
[18]

Dyrland, A

K. Dyrland, A. S. Lundervold, and P. G. L. P. Mana. Does the evaluation stand up to evaluation? a first-principle approach to the evaluation of classifiers, 2023. URL https://arxiv.org/ abs/2302.12006

work page arXiv 2023
[19]

El-Yaniv and Y

R. El-Yaniv and Y . Wiener. On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research, 11(53):1605–1641, 2010. ISSN 1533-7928. URL http: //jmlr.org/papers/v11/el-yaniv10a.html

work page 2010
[20]

Fadeeva, A

E. Fadeeva, A. Rubashevskii, A. Shelmanov, S. Petrakov, H. Li, H. Mubarak, E. Tsym- balov, G. Kuzmin, A. Panchenko, T. Baldwin, P. Nakov, and M. Panov. Fact-checking the output of large language models via token-level uncertainty quantification. In L.- W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Compu- tational Linguistics...

work page doi:10.18653/v1/2024.findings-acl.558 2024
[21]

Farquhar, J

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 06 2024. doi: 10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[22]

L. Ferrer. No need for ad-hoc substitutes: The expected cost is a principled all-purpose classification metric.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=5PPbvCExZs

work page 2025
[23]

Ferrer and D

L. Ferrer and D. Ramos. Evaluating posterior probabilities: Decision theory, proper scoring rules, and calibration.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=qbrE0LR7fF

work page 2025
[24]

Franc, D

V . Franc, D. Prusa, and V . V oracek. Optimal strategies for reject option classifiers.Journal of Machine Learning Research, 24(11):1–49, 2023

work page 2023
[25]

Franc, D

V . Franc, D. Prusa, and V . V oracek. Optimal Strategies for Reject Option Classifiers.Journal of Machine Learning Research, 24(11):1–49, 2023. ISSN 1533-7928. URL http://jmlr.org/ papers/v24/21-0048.html

work page 2023
[26]

Galil and R

I. Galil and R. El-Yaniv. Disrupting deep uncertainty estimation without harming accuracy. Advances in Neural Information Processing Systems, 34:21285–21296, 2021

work page 2021
[27]

X. Gao, J. Zhang, L. Mouatadid, and K. Das. SPUQ: Perturbation-based uncertainty quan- tification for large language models. In Y . Graham and M. Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2336–2346, St. Julian’s, Malta, Mar. 2024. Association f...

work page doi:10.18653/v1/2024.eacl-long.143 2024
[28]

Geifman and R

Y . Geifman and R. El-Yaniv. Selective Classification for Deep Neural Networks. InAdvances in Neural Information Processing Systems, volume 30. Curran Asso- ciates, Inc., 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/ 4a8423d5e91fda00bb7e46540e2b0cf1-Abstract.html

work page 2017
[30]

Geifman, G

Y . Geifman, G. Uziel, and R. El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. InInternational Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=SJfb5jCqKm

work page 2019
[31]

J. Geng, F. Cai, Y . Wang, H. Koeppl, P. Nakov, and I. Gurevych. A survey of confidence estimation and calibration in large language models. In K. Duh, H. Gomez, and S. Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...

work page doi:10.18653/v1/2024.naacl-long.366 2024
[32]

Gneiting and A

T. Gneiting and A. E. Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477):359–378, Mar. 2007. ISSN 0162- 1459, 1537-274X. doi: 10.1198/016214506000001437. URL http://www.tandfonline.com/ doi/abs/10.1198/016214506000001437

work page doi:10.1198/016214506000001437 2007
[33]

I. J. Good. Rational decisions.Journal of the Royal Statistical Society: Series B (Methodologi- cal), 14(1):107–114, 01 1952. ISSN 0035-9246. doi: 10.1111/j.2517-6161.1952.tb00104.x. URLhttps://doi.org/10.1111/j.2517-6161.1952.tb00104.x

work page doi:10.1111/j.2517-6161.1952.tb00104.x 1952
[34]

A. Gulli. The anatomy of a news search engine. InSpecial Interest Tracks and Posters of the 14th International Conference on World Wide Web, pages 880–881, New York, 2005

work page 2005
[35]

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proc. of the 34th International Conference on Machine Learning, Sydney, Australia, 2017

work page 2017
[36]

W. He, Z. Jiang, T. Xiao, Z. Xu, and Y . Li. A survey on uncertainty quantification methods for deep learning.ACM Comput. Surv., 58(7), Feb. 2026. ISSN 0360-0300. doi: 10.1145/3786319. URLhttps://doi.org/10.1145/3786319

work page doi:10.1145/3786319 2026
[37]

A. D. Hendrickson and R. J. Buehler. Proper scores for probability forecasters.The Annals of Mathematical Statistics, pages 1916–1921, 1971

work page 1916
[38]

Hendrycks and K

D. Hendrycks and K. Gimpel. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. Feb. 2017. URL https://openreview.net/forum?id= Hkg4TI9xl

work page 2017
[39]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[40]

J. Heo, H. B. Lee, S. Kim, J. Lee, K. J. Kim, E. Yang, and S. J. Hwang. Uncertainty-aware attention for reliable interpretation and prediction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Process- ing Systems, volume 31. Curran Associates, Inc., 2018. URLhttps://proceedings.n...

work page 2018
[41]

B. Hou, Y . Liu, K. Qian, J. Andreas, S. Chang, and Y . Zhang. Decomposing uncertainty for large language models through input clarification ensembling. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[42]

M. G. M. Hunink, M. C. Weinstein, E. Wittenberg, M. F. Drummond, J. S. Pliskin, J. B. Wong, and P. P. Glasziou.Decision Making in Health and Medicine: Integrating Evidence and Values. Cambridge University Press, 2 edition, 2014

work page 2014
[43]

P. F. Jäger, C. Lüth, L. Klein, and T. Bungert. A call to reflect on evaluation practices for failure detection in image classification. InICLR 2023, 2023

work page 2023
[44]

Jiang, J

Z. Jiang, J. Araki, H. Ding, and G. Neubig. How can we know when language models know? on the calibration of language models for question answering.Transactions of the Association for Computational Linguistics, 9:962–977, 2021. doi: 10.1162/tacl_a_00407. URL https://aclanthology.org/2021.tacl-1.57/

work page doi:10.1162/tacl_a_00407 2021
[45]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y . Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Comp...

work page doi:10.18653/v1/p17-1147 2017
[46]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield- Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Kahneman.Thinking, fast and slow

D. Kahneman.Thinking, fast and slow. 1st ed. New York : Farrar, Straus and Giroux, 2011. URLhttps://search.library.wisc.edu/catalog/9910114919702121. 12

work page arXiv 2011
[48]

Kapoor, N

S. Kapoor, N. Gruver, M. Roberts, A. Pal, S. Dooley, M. Goldblum, and A. Wilson. Calibration- tuning: Teaching large language models to know what they don’t know. In R. Vázquez, H. Celikkanat, D. Ulmer, J. Tiedemann, S. Swayamdipta, W. Aziz, B. Plank, J. Baan, and M.-C. de Marneffe, editors,Proceedings of the 1st Workshop on Uncertainty-Aware NLP (Uncerta...

work page doi:10.18653/v1/2024.uncertainlp-1.1 2024
[49]

J. Kim, J. Koo, and S. Hwang. A unified benchmark for the unknown detection capability of deep neural networks.Expert Systems with Applications, 229:120461, 2023

work page 2023
[50]

Krizhevsky and G

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs.toronto. edu/~kriz/learning-features-2009-TR.pdf

work page 2009
[51]

L. Kuhn, Y . Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=VD-AYtP0dve

work page 2023
[52]

Calibration of Encoder Decoder Models for Neural Machine Translation

A. Kumar and S. Sarawagi. Calibration of encoder decoder models for neural machine transla- tion.arXiv preprint arXiv:1903.00802, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[53]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc...

work page 2017
[54]

S. Lin, J. Hilton, and O. Evans. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum? id=8s8K2UZGTZ

work page 2022
[55]

Z. Lin, S. Trivedi, and J. Sun. Generating with confidence: Uncertainty quantification for black-box large language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=DWkJCSxKU5

work page 2024
[56]

Z. Lin, S. Trivedi, and J. Sun. Contextualized sequence likelihood: Enhanced confidence scores for natural language generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, Nov. 2024. URL https: //aclanthology.org/2024.emnlp-main.578/

work page 2024
[57]

A. Liu, K. Khandelwal, S. Subramanian, V . Jouault, A. Rastogi, A. Sad’e, A. Jeffares, A. Q. Jiang, A. Cahill, A. Gavaudan, A. Sablayrolles, A. H’eliou, A. You, A. Ehrenberg, A. D. Lo, A. Eliseev, A. Calvi, A. Sooriyarachchi, B. Bout, B. Rozière, B. D. Monicault, C. Lanfranchi, C. Barreau, C. Courtot, D. Grattarola, D. Dabert, D. de Las Casas, E. Chane-Sa...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

X. Liu, M. Khalifa, and L. Wang. Litcab: Lightweight language model calibration over short- and long-form responses. InThe Twelfth International Conference on Learning Representations,

work page
[59]

URLhttps://openreview.net/forum?id=jH67LHVOIO

work page
[60]

X. Liu, T. Chen, L. Da, C. Chen, Z. Lin, and H. Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, KDD ’25, page 6107–6117, New 13 York, NY , USA, 2025. Association for Computing Machinery. ISBN 9798400714542. doi: 10.11...

work page doi:10.1145/3711896.3736569 2025
[61]

Macêdo, T

D. Macêdo, T. I. Ren, C. Zanchettin, A. L. I. Oliveira, and T. Ludermir. Entropic out-of- distribution detection: Seamless detection of unknown examples.IEEE Transactions on Neural Networks and Learning Systems, 33(6):2350–2364, 2022. doi: 10.1109/TNNLS.2021.3112897

work page doi:10.1109/tnnls.2021.3112897 2022
[62]

Malinin and M

A. Malinin and M. Gales. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=jN5y-zb5Q7m

work page 2021
[63]

McLaren, L

M. McLaren, L. Ferrer, D. Castan, and A. Lawson. The speakers in the wild (SITW) speaker recognition database. InProc. Interspeech, San Francisco, Sept. 2016

work page 2016
[64]

S. J. Mielke, A. Szlam, E. Dinan, and Y .-L. Boureau. Reducing conversational agents’ over- confidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl_a_00494. URL https://aclanthology. org/2022.tacl-1.50/

work page doi:10.1162/tacl_a_00494 2022
[65]

Morrison, C

G. Morrison, C. Zhang, and E. Enzinger et. al. Forensic database of voice recordings of 500+ australian english speakers.http://databases.forensic-voice-comparison.net, 2015

work page 2015
[66]

G. S. Morrison, P. Rose, and C. Zhang. Protocol for the collection of databases of recordings for forensic-voice-comparison research and practice.Australian Journal of Forensic Sciences, 44(2):155–167, June 2012

work page 2012
[67]

M. S. A. Nadeem, J.-D. Zucker, and B. Hanczar. Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. In S. Džeroski, P. Guerts, and J. Rousu, editors, Proceedings of the third International Workshop on Machine Learning in Systems Biology, volume 8 ofProceedings of Machine Learning Research, pages 65–81, Ljubljana, Slo...

work page 2009
[68]

Naushad and I

J. Naushad and I. V oiculescu. Super-trustscore: Reliable failure detection for automated skin lesion diagnosis. In2024 IEEE International Symposium on Biomedical Imaging (ISBI), pages 1–4, 2024. doi: 10.1109/ISBI56570.2024.10635815

work page doi:10.1109/isbi56570.2024.10635815 2024
[69]

Peterson.An Introduction to Decision Theory

M. Peterson.An Introduction to Decision Theory. Cambridge Introductions to Philosophy. Cambridge University Press, 2 edition, 2017

work page 2017
[70]

M. M. H. Raiffa. Decision analysis. introductory lectures on choices under uncertainty. Recherches économiques de Louvain, 36(5):527–528, 1970

work page 1970
[71]

Russell and P

S. Russell and P. Norvig.Artificial Intelligence: A Modern Approach. Prentice Hall, 2010

work page 2010
[72]

Russell and P

S. Russell and P. Norvig.Artificial Intelligence: A Modern Approach. Always learning. Pearson, 2016. ISBN 9781292153964. URL https://books.google.com.ar/books?id= XS9CjwEACAAJ

work page 2016
[73]

L. J. Savage. The foundations of statistics reconsidered. InProceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 575–587. University of California Press, 1961

work page 1961
[74]

L. J. Savage.The foundations of statistics. Courier Corporation, 1972

work page 1972
[75]

Socher, A

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, Oct. 2013

work page 2013
[76]

Stengel-Eskin and B

E. Stengel-Eskin and B. Van Durme. Calibrated interpretation: Confidence estimation in semantic parsing.Transactions of the Association for Computational Linguistics, 11:1213–1231,

work page
[77]

URLhttps://aclanthology.org/2023.tacl-1.69/

doi: 10.1162/tacl_a_00598. URLhttps://aclanthology.org/2023.tacl-1.69/

work page doi:10.1162/tacl_a_00598 2023
[78]

Q. Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id=qwen3.5

work page 2026
[79]

V . Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y . Wang, Y . Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, J. Xu, J. Zhu, J. Chen, J. Chen, J. Chen, J. Lin, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M. Xu, M. Zhang...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In H. Bouamor, J. Pino, and K. Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pag...

work page doi:10.18653/v1/2023 2023
[81]

D. Tran, J. Z. Liu, M. W. Dusenberry, D. Phan, M. Collier, J. Ren, K. Han, Z. Wang, Z. E. Mariet, H. Hu, N. Band, T. G. J. Rudner, Z. Nado, J. van Amersfoort, A. Kirsch, R. Jenatton, N. Thain, E. K. Buchanan, K. P. Murphy, D. Sculley, Y . Gal, Z. Ghahramani, J. Snoek, and B. Lakshminarayanan. Plex: Towards reliability using pretrained large model extensio...

work page 2022
[82]

Traub, T

J. Traub, T. J. Bungert, C. T. Lüth, M. Baumgartner, K. Maier-Hein, L. Maier-hein, and P. F. Jaeger. Overcoming Common Flaws in the Evaluation of Selective Classification Systems. Nov

work page

Showing first 80 references.

[1] [1]

Ashukha, A

A. Ashukha, A. Lyzhov, D. Molchanov, and D. Vetrov. Pitfalls of in-domain uncertainty estima- tion and ensembling in deep learning. InInternational Conference on Learning Representations,

work page

[2] [2]

URLhttps://openreview.net/forum?id=BJxI5gHKDr

work page

[3] [3]

P. L. Bartlett and M. H. Wegkamp. Classification with a reject option using a hinge loss.J. Mach. Learn. Res., 9:1823–1840, 2008. URL https://api.semanticscholar.org/CorpusID: 16963069

work page 2008

[4] [4]

Brummer.Measuring, refining and calibrating speaker and language information extracted from speech

N. Brummer.Measuring, refining and calibrating speaker and language information extracted from speech. PhD thesis, University of Stellenbosch, 2010. URL https://scholar.sun.ac. za/items/1b46805b-2b1e-46aa-83ce-75ede92f0159

work page 2010

[5] [5]

Brümmer.Measuring, Refining and Calibrating Speaker and Language Information Ex- tracted from Speech

N. Brümmer.Measuring, Refining and Calibrating Speaker and Language Information Ex- tracted from Speech. PhD thesis, Stellenbosch University, 2010

work page 2010

[6] [6]

T. J. Bungert, L. Kobelke, and P. F. Jaeger. Understanding silent failures in medical image classification. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 400–410. Springer, 2023

work page 2023

[7] [7]

Busso, M

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower Provost, S. Kim, J. Chang, S. Lee, and S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42:335–359, 12 2008. doi: 10.1007/s10579-008-9076-6

work page doi:10.1007/s10579-008-9076-6 2008

[8] [8]

L. F. P. Cattelan and D. Silva. How to fix a broken confidence estimator: Evaluating post- hoc methods for selective classification with deep neural networks. InThe 40th Conference on Uncertainty in Artificial Intelligence, 2024. URL https://openreview.net/forum?id= IJBWLRCvYX

work page 2024

[9] [9]

J. Cen, D. Luan, S. Zhang, Y . Pei, Y . Zhang, D. Zhao, S. Shen, and Q. Chen. The devil is in the wrongly-classified samples: Towards unified open-set recognition.arXiv preprint arXiv:2302.04002, 2023

work page arXiv 2023

[10] [10]

Charoenphakdee, Z

N. Charoenphakdee, Z. Cui, Y . Zhang, and M. Sugiyama. Classification with rejection based on cost-sensitive classification. InInternational Conference on Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:225041187

work page 2020

[11] [11]

Cheng, X.-Y

Z. Cheng, X.-Y . Zhang, and C.-L. Liu. Unified classification and rejection: A one-versus-all framework.arXiv preprint arXiv:2311.13355, 2023

work page arXiv 2023

[12] [12]

C. K. Chow. An optimum character recognition system using decision functions.IRE Trans- actions on Electronic Computers, EC-6(4):247–254, Dec. 1957. ISSN 0367-9950. doi: 10.1109/TEC.1957.5222035. URLhttps://ieeexplore.ieee.org/document/5222035

work page doi:10.1109/tec.1957.5222035 1957

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

A. P. Dawid and M. Musio. Theory and applications of proper scoring rules.METRON, 72(2): 169–183, Apr 2014. ISSN 2281-695X

work page 2014

[15] [15]

Y . Ding, J. Liu, J. Xiong, and Y . Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4–5, 2020

work page 2020

[16] [16]

J. Duan, H. Cheng, S. Wang, A. Zavalny, C. Wang, R. Xu, B. Kailkhura, and K. Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics, Bangkok, Thailand, Aug. 2024. URL https://aclanthology.org/2024....

work page 2024

[17] [18]

Dyrland, A

K. Dyrland, A. S. Lundervold, and P. G. L. P. Mana. Does the evaluation stand up to evaluation? a first-principle approach to the evaluation of classifiers, 2023. URL https://arxiv.org/ abs/2302.12006

work page arXiv 2023

[18] [19]

El-Yaniv and Y

R. El-Yaniv and Y . Wiener. On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research, 11(53):1605–1641, 2010. ISSN 1533-7928. URL http: //jmlr.org/papers/v11/el-yaniv10a.html

work page 2010

[19] [20]

Fadeeva, A

E. Fadeeva, A. Rubashevskii, A. Shelmanov, S. Petrakov, H. Li, H. Mubarak, E. Tsym- balov, G. Kuzmin, A. Panchenko, T. Baldwin, P. Nakov, and M. Panov. Fact-checking the output of large language models via token-level uncertainty quantification. In L.- W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Compu- tational Linguistics...

work page doi:10.18653/v1/2024.findings-acl.558 2024

[20] [21]

Farquhar, J

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 06 2024. doi: 10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024

[21] [22]

L. Ferrer. No need for ad-hoc substitutes: The expected cost is a principled all-purpose classification metric.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=5PPbvCExZs

work page 2025

[22] [23]

Ferrer and D

L. Ferrer and D. Ramos. Evaluating posterior probabilities: Decision theory, proper scoring rules, and calibration.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=qbrE0LR7fF

work page 2025

[23] [24]

Franc, D

V . Franc, D. Prusa, and V . V oracek. Optimal strategies for reject option classifiers.Journal of Machine Learning Research, 24(11):1–49, 2023

work page 2023

[24] [25]

Franc, D

V . Franc, D. Prusa, and V . V oracek. Optimal Strategies for Reject Option Classifiers.Journal of Machine Learning Research, 24(11):1–49, 2023. ISSN 1533-7928. URL http://jmlr.org/ papers/v24/21-0048.html

work page 2023

[25] [26]

Galil and R

I. Galil and R. El-Yaniv. Disrupting deep uncertainty estimation without harming accuracy. Advances in Neural Information Processing Systems, 34:21285–21296, 2021

work page 2021

[26] [27]

X. Gao, J. Zhang, L. Mouatadid, and K. Das. SPUQ: Perturbation-based uncertainty quan- tification for large language models. In Y . Graham and M. Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2336–2346, St. Julian’s, Malta, Mar. 2024. Association f...

work page doi:10.18653/v1/2024.eacl-long.143 2024

[27] [28]

Geifman and R

Y . Geifman and R. El-Yaniv. Selective Classification for Deep Neural Networks. InAdvances in Neural Information Processing Systems, volume 30. Curran Asso- ciates, Inc., 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/ 4a8423d5e91fda00bb7e46540e2b0cf1-Abstract.html

work page 2017

[28] [30]

Geifman, G

Y . Geifman, G. Uziel, and R. El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. InInternational Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=SJfb5jCqKm

work page 2019

[29] [31]

J. Geng, F. Cai, Y . Wang, H. Koeppl, P. Nakov, and I. Gurevych. A survey of confidence estimation and calibration in large language models. In K. Duh, H. Gomez, and S. Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...

work page doi:10.18653/v1/2024.naacl-long.366 2024

[30] [32]

Gneiting and A

T. Gneiting and A. E. Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477):359–378, Mar. 2007. ISSN 0162- 1459, 1537-274X. doi: 10.1198/016214506000001437. URL http://www.tandfonline.com/ doi/abs/10.1198/016214506000001437

work page doi:10.1198/016214506000001437 2007

[31] [33]

I. J. Good. Rational decisions.Journal of the Royal Statistical Society: Series B (Methodologi- cal), 14(1):107–114, 01 1952. ISSN 0035-9246. doi: 10.1111/j.2517-6161.1952.tb00104.x. URLhttps://doi.org/10.1111/j.2517-6161.1952.tb00104.x

work page doi:10.1111/j.2517-6161.1952.tb00104.x 1952

[32] [34]

A. Gulli. The anatomy of a news search engine. InSpecial Interest Tracks and Posters of the 14th International Conference on World Wide Web, pages 880–881, New York, 2005

work page 2005

[33] [35]

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proc. of the 34th International Conference on Machine Learning, Sydney, Australia, 2017

work page 2017

[34] [36]

W. He, Z. Jiang, T. Xiao, Z. Xu, and Y . Li. A survey on uncertainty quantification methods for deep learning.ACM Comput. Surv., 58(7), Feb. 2026. ISSN 0360-0300. doi: 10.1145/3786319. URLhttps://doi.org/10.1145/3786319

work page doi:10.1145/3786319 2026

[35] [37]

A. D. Hendrickson and R. J. Buehler. Proper scores for probability forecasters.The Annals of Mathematical Statistics, pages 1916–1921, 1971

work page 1916

[36] [38]

Hendrycks and K

D. Hendrycks and K. Gimpel. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. Feb. 2017. URL https://openreview.net/forum?id= Hkg4TI9xl

work page 2017

[37] [39]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021

[38] [40]

J. Heo, H. B. Lee, S. Kim, J. Lee, K. J. Kim, E. Yang, and S. J. Hwang. Uncertainty-aware attention for reliable interpretation and prediction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Process- ing Systems, volume 31. Curran Associates, Inc., 2018. URLhttps://proceedings.n...

work page 2018

[39] [41]

B. Hou, Y . Liu, K. Qian, J. Andreas, S. Chang, and Y . Zhang. Decomposing uncertainty for large language models through input clarification ensembling. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[40] [42]

M. G. M. Hunink, M. C. Weinstein, E. Wittenberg, M. F. Drummond, J. S. Pliskin, J. B. Wong, and P. P. Glasziou.Decision Making in Health and Medicine: Integrating Evidence and Values. Cambridge University Press, 2 edition, 2014

work page 2014

[41] [43]

P. F. Jäger, C. Lüth, L. Klein, and T. Bungert. A call to reflect on evaluation practices for failure detection in image classification. InICLR 2023, 2023

work page 2023

[42] [44]

Jiang, J

Z. Jiang, J. Araki, H. Ding, and G. Neubig. How can we know when language models know? on the calibration of language models for question answering.Transactions of the Association for Computational Linguistics, 9:962–977, 2021. doi: 10.1162/tacl_a_00407. URL https://aclanthology.org/2021.tacl-1.57/

work page doi:10.1162/tacl_a_00407 2021

[43] [45]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y . Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Comp...

work page doi:10.18653/v1/p17-1147 2017

[44] [46]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield- Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [47]

Kahneman.Thinking, fast and slow

D. Kahneman.Thinking, fast and slow. 1st ed. New York : Farrar, Straus and Giroux, 2011. URLhttps://search.library.wisc.edu/catalog/9910114919702121. 12

work page arXiv 2011

[46] [48]

Kapoor, N

S. Kapoor, N. Gruver, M. Roberts, A. Pal, S. Dooley, M. Goldblum, and A. Wilson. Calibration- tuning: Teaching large language models to know what they don’t know. In R. Vázquez, H. Celikkanat, D. Ulmer, J. Tiedemann, S. Swayamdipta, W. Aziz, B. Plank, J. Baan, and M.-C. de Marneffe, editors,Proceedings of the 1st Workshop on Uncertainty-Aware NLP (Uncerta...

work page doi:10.18653/v1/2024.uncertainlp-1.1 2024

[47] [49]

J. Kim, J. Koo, and S. Hwang. A unified benchmark for the unknown detection capability of deep neural networks.Expert Systems with Applications, 229:120461, 2023

work page 2023

[48] [50]

Krizhevsky and G

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs.toronto. edu/~kriz/learning-features-2009-TR.pdf

work page 2009

[49] [51]

L. Kuhn, Y . Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=VD-AYtP0dve

work page 2023

[50] [52]

Calibration of Encoder Decoder Models for Neural Machine Translation

A. Kumar and S. Sarawagi. Calibration of encoder decoder models for neural machine transla- tion.arXiv preprint arXiv:1903.00802, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[51] [53]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc...

work page 2017

[52] [54]

S. Lin, J. Hilton, and O. Evans. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum? id=8s8K2UZGTZ

work page 2022

[53] [55]

Z. Lin, S. Trivedi, and J. Sun. Generating with confidence: Uncertainty quantification for black-box large language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=DWkJCSxKU5

work page 2024

[54] [56]

Z. Lin, S. Trivedi, and J. Sun. Contextualized sequence likelihood: Enhanced confidence scores for natural language generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, Nov. 2024. URL https: //aclanthology.org/2024.emnlp-main.578/

work page 2024

[55] [57]

A. Liu, K. Khandelwal, S. Subramanian, V . Jouault, A. Rastogi, A. Sad’e, A. Jeffares, A. Q. Jiang, A. Cahill, A. Gavaudan, A. Sablayrolles, A. H’eliou, A. You, A. Ehrenberg, A. D. Lo, A. Eliseev, A. Calvi, A. Sooriyarachchi, B. Bout, B. Rozière, B. D. Monicault, C. Lanfranchi, C. Barreau, C. Courtot, D. Grattarola, D. Dabert, D. de Las Casas, E. Chane-Sa...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [58]

X. Liu, M. Khalifa, and L. Wang. Litcab: Lightweight language model calibration over short- and long-form responses. InThe Twelfth International Conference on Learning Representations,

work page

[57] [59]

URLhttps://openreview.net/forum?id=jH67LHVOIO

work page

[58] [60]

X. Liu, T. Chen, L. Da, C. Chen, Z. Lin, and H. Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, KDD ’25, page 6107–6117, New 13 York, NY , USA, 2025. Association for Computing Machinery. ISBN 9798400714542. doi: 10.11...

work page doi:10.1145/3711896.3736569 2025

[59] [61]

Macêdo, T

D. Macêdo, T. I. Ren, C. Zanchettin, A. L. I. Oliveira, and T. Ludermir. Entropic out-of- distribution detection: Seamless detection of unknown examples.IEEE Transactions on Neural Networks and Learning Systems, 33(6):2350–2364, 2022. doi: 10.1109/TNNLS.2021.3112897

work page doi:10.1109/tnnls.2021.3112897 2022

[60] [62]

Malinin and M

A. Malinin and M. Gales. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=jN5y-zb5Q7m

work page 2021

[61] [63]

McLaren, L

M. McLaren, L. Ferrer, D. Castan, and A. Lawson. The speakers in the wild (SITW) speaker recognition database. InProc. Interspeech, San Francisco, Sept. 2016

work page 2016

[62] [64]

S. J. Mielke, A. Szlam, E. Dinan, and Y .-L. Boureau. Reducing conversational agents’ over- confidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl_a_00494. URL https://aclanthology. org/2022.tacl-1.50/

work page doi:10.1162/tacl_a_00494 2022

[63] [65]

Morrison, C

G. Morrison, C. Zhang, and E. Enzinger et. al. Forensic database of voice recordings of 500+ australian english speakers.http://databases.forensic-voice-comparison.net, 2015

work page 2015

[64] [66]

G. S. Morrison, P. Rose, and C. Zhang. Protocol for the collection of databases of recordings for forensic-voice-comparison research and practice.Australian Journal of Forensic Sciences, 44(2):155–167, June 2012

work page 2012

[65] [67]

M. S. A. Nadeem, J.-D. Zucker, and B. Hanczar. Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. In S. Džeroski, P. Guerts, and J. Rousu, editors, Proceedings of the third International Workshop on Machine Learning in Systems Biology, volume 8 ofProceedings of Machine Learning Research, pages 65–81, Ljubljana, Slo...

work page 2009

[66] [68]

Naushad and I

J. Naushad and I. V oiculescu. Super-trustscore: Reliable failure detection for automated skin lesion diagnosis. In2024 IEEE International Symposium on Biomedical Imaging (ISBI), pages 1–4, 2024. doi: 10.1109/ISBI56570.2024.10635815

work page doi:10.1109/isbi56570.2024.10635815 2024

[67] [69]

Peterson.An Introduction to Decision Theory

M. Peterson.An Introduction to Decision Theory. Cambridge Introductions to Philosophy. Cambridge University Press, 2 edition, 2017

work page 2017

[68] [70]

M. M. H. Raiffa. Decision analysis. introductory lectures on choices under uncertainty. Recherches économiques de Louvain, 36(5):527–528, 1970

work page 1970

[69] [71]

Russell and P

S. Russell and P. Norvig.Artificial Intelligence: A Modern Approach. Prentice Hall, 2010

work page 2010

[70] [72]

Russell and P

S. Russell and P. Norvig.Artificial Intelligence: A Modern Approach. Always learning. Pearson, 2016. ISBN 9781292153964. URL https://books.google.com.ar/books?id= XS9CjwEACAAJ

work page 2016

[71] [73]

L. J. Savage. The foundations of statistics reconsidered. InProceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 575–587. University of California Press, 1961

work page 1961

[72] [74]

L. J. Savage.The foundations of statistics. Courier Corporation, 1972

work page 1972

[73] [75]

Socher, A

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, Oct. 2013

work page 2013

[74] [76]

Stengel-Eskin and B

E. Stengel-Eskin and B. Van Durme. Calibrated interpretation: Confidence estimation in semantic parsing.Transactions of the Association for Computational Linguistics, 11:1213–1231,

work page

[75] [77]

URLhttps://aclanthology.org/2023.tacl-1.69/

doi: 10.1162/tacl_a_00598. URLhttps://aclanthology.org/2023.tacl-1.69/

work page doi:10.1162/tacl_a_00598 2023

[76] [78]

Q. Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id=qwen3.5

work page 2026

[77] [79]

V . Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y . Wang, Y . Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, J. Xu, J. Zhu, J. Chen, J. Chen, J. Chen, J. Lin, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M. Xu, M. Zhang...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[78] [80]

K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In H. Bouamor, J. Pino, and K. Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pag...

work page doi:10.18653/v1/2023 2023

[79] [81]

D. Tran, J. Z. Liu, M. W. Dusenberry, D. Phan, M. Collier, J. Ren, K. Han, Z. Wang, Z. E. Mariet, H. Hu, N. Band, T. G. J. Rudner, Z. Nado, J. van Amersfoort, A. Kirsch, R. Jenatton, N. Thain, E. K. Buchanan, K. P. Murphy, D. Sculley, Y . Gal, Z. Ghahramani, J. Snoek, and B. Lakshminarayanan. Plex: Towards reliability using pretrained large model extensio...

work page 2022

[80] [82]

Traub, T

J. Traub, T. J. Bungert, C. T. Lüth, M. Baumgartner, K. Maier-Hein, L. Maier-hein, and P. F. Jaeger. Overcoming Common Flaws in the Evaluation of Selective Classification Systems. Nov

work page