Decision-Aligned Evaluation of Uncertainty Quantification

Annika Schneider; Joshua Stiller; Tommy Rochussen; Vincent Fortuin

arxiv: 2606.26990 · v1 · pith:OPT7NIN3new · submitted 2026-06-25 · 💻 cs.LG · cs.AI· stat.ML

Decision-Aligned Evaluation of Uncertainty Quantification

Annika Schneider , Tommy Rochussen , Joshua Stiller , Vincent Fortuin This is my paper

Pith reviewed 2026-06-26 05:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords uncertainty quantificationdecision alignmentproper scoring rulesevaluation metricsmachine learningutility functionscalibration error

0 comments

The pith

Many standard uncertainty metrics misalign with downstream decision utilities, while prior-weighted utility metrics align consistently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces decision-alignment as a criterion that checks whether an uncertainty metric's scores correspond to the actual utility gained from decisions made with those uncertainties. It finds that widely used metrics such as negative log-likelihood and expected calibration error either fail to match common decision problems or rest on unrealistic assumptions about the downstream task. The authors then define prior-weighted utility metrics, a subclass of proper scoring rules that incorporate priors over decision utilities to achieve alignment. Benchmark experiments and real-world case studies demonstrate that these new metrics track realized decision outcomes, whereas conventional ones do not. The work therefore identifies a gap in current uncertainty quantification evaluation practices and supplies a concrete way to close it.

Core claim

Decision-alignment is the criterion that reveals which uncertainty evaluation metrics meaningfully align with downstream utilities. Applying it shows that many widely used metrics are either misaligned with common decision problems or encode pathological prior beliefs about the downstream task. Prior-weighted utility metrics, defined as a special class of proper scoring rules, provide decision-aligned uncertainty evaluation. Across benchmark experiments and real-world case studies these metrics consistently align with realized decision utility while conventional metrics do not.

What carries the argument

The decision-alignment criterion, which tests whether a metric's performance on uncertainty estimates corresponds to the expected utility of decisions that use those estimates.

If this is right

Uncertainty evaluation protocols should incorporate explicit utility functions that reflect the intended downstream decisions.
Benchmarks that rely solely on calibration or likelihood metrics may systematically favor metrics that do not improve decisions.
Proper scoring rules can be extended to decision relevance by weighting them according to priors over possible utility functions.
Real-world deployment of uncertainty estimates should be validated against observed decision outcomes rather than generic scores alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied to other evaluation settings in machine learning where generic metrics are known to diverge from task-specific value.
Teams adopting this approach would need to articulate their decision utilities explicitly before selecting an evaluation metric.
The results suggest that metric choice itself encodes implicit assumptions about how uncertainty will be used, which can be made visible and adjustable.

Load-bearing premise

The downstream decision problems can be faithfully represented by the utility functions and priors used to define the proposed metrics and the decision-alignment criterion.

What would settle it

An experiment on a new decision task in which prior-weighted utility metrics exhibit lower correlation with realized decision utility than expected calibration error or negative log-likelihood.

Figures

Figures reproduced from arXiv: 2606.26990 by Annika Schneider, Joshua Stiller, Tommy Rochussen, Vincent Fortuin.

**Figure 1.** Figure 1: Prior choices for our classification (left) and regression (right) prior-weighted utility metrics. where ϕ = (k, γ) ∈ Φ := [n − 1] × R>0, and γ controls the degree of risk aversion. The Bayes act is to select the top-k instances according to the score µi − γσ2 i . This yields a selection policy that trades off between predicted benefit and uncertainty. As in classification, we can show that none of the met… view at source ↗

**Figure 2.** Figure 2: Metric–utility alignment in classification [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Uncertainty estimates in machine learning are typically evaluated using generic metrics such as the negative log-likelihood and expected calibration error, yet good performance on such metrics does not necessarily imply high utility in downstream decisions. We introduce decision-alignment, a criterion that reveals which evaluation metrics meaningfully align with downstream utilities. Applying this framework, we show that many widely used uncertainty metrics are either misaligned with common decision problems or encode pathological prior beliefs about the downstream task. We then propose prior-weighted utility metrics, a special class of proper scoring rules that provides decision-aligned uncertainty evaluation. Across benchmark experiments and real-world case studies, our metrics consistently align with realized decision utility, while conventional metrics do not. Our results surface flaws in the current UQ evaluation protocol and offer a principled extension of existing metrics toward decision-relevant UQ evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces decision-alignment and prior-weighted utility metrics to tie UQ evaluation to downstream decisions, but the experimental claims rest on unverified choices of utilities and priors.

read the letter

The main thing to know is that this work defines a decision-alignment criterion and a class of prior-weighted utility metrics as an extension of proper scoring rules, then uses them to argue that standard UQ metrics like NLL and ECE often fail to match realized decision utility.

What the paper does well is lay out how generic metrics can implicitly assume priors that do not match common decision problems. Framing the issue in terms of utility functions and priors is a direct way to connect evaluation to practice, and the formal link to proper scoring rules keeps the proposal grounded rather than ad hoc.

The soft spot is the empirical support. The abstract and claims rest on benchmark experiments and case studies where the new metrics align with decision outcomes while others do not, yet there is no detail on how the utilities were elicited, whether they were varied, or how sensitive the misalignment results are to reasonable alternative utilities. The stress-test note is on target here: if the chosen utilities do not represent the actual downstream problems, the finding that conventional metrics are broadly misaligned does not follow.

This is for researchers focused on uncertainty quantification in applied settings who already think about decision theory. A reader looking for a principled way to critique or extend current evaluation protocols would get value from the framework, even if the experiments need tightening.

I would send it to peer review. The core idea is worth referee scrutiny because it is formally connected to existing scoring-rule literature and could affect how people design UQ benchmarks if the utility choices can be made more transparent and robust.

Referee Report

2 major / 1 minor

Summary. The paper introduces a decision-alignment criterion for assessing uncertainty quantification metrics according to their alignment with downstream decision utilities. It argues that standard metrics such as negative log-likelihood and expected calibration error are frequently misaligned with common decision problems or encode pathological priors about the task. The authors propose prior-weighted utility metrics as a class of proper scoring rules that achieve decision alignment. They report that these metrics consistently match realized decision utility across benchmark experiments and real-world case studies, whereas conventional metrics do not, and conclude that current UQ evaluation protocols require principled extension toward decision-relevant evaluation.

Significance. If the chosen utilities and priors are representative of actual deployment problems and the empirical results are robust, the work would provide a concrete way to make UQ evaluation decision-relevant rather than generic, addressing a recognized disconnect between calibration metrics and practical utility. The framing via external decision-theoretic concepts and the identification of a special class of proper scoring rules constitute a clear conceptual contribution; reproducible code or machine-checked elements are not mentioned.

major comments (2)

[Abstract] Abstract: the central empirical claim states that the proposed metrics 'consistently align with realized decision utility, while conventional metrics do not' across benchmarks and case studies, yet the abstract supplies no information on experimental design, data exclusion rules, error bars, statistical tests, or the elicitation procedure for the decision utilities and priors. Without these details the claim cannot be evaluated and may rest on post-hoc selection.
[Framework definition and experimental setup] Framework definition and experimental setup: the decision-alignment criterion and prior-weighted utility metrics are defined relative to the authors' chosen utility functions and priors. The reported alignment therefore holds only inside those choices. No independent verification is described that the selected utilities match the true downstream problems encountered in deployment, nor are robustness results shown under plausible alternative utilities; this directly affects whether the generalization that 'many widely used uncertainty metrics are either misaligned ... or encode pathological prior beliefs' follows.

minor comments (1)

[Abstract] The abstract refers to 'pathological prior beliefs' encoded by conventional metrics without giving concrete examples of those priors or the decision problems they distort; adding one or two explicit illustrations would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we address each major point directly, clarifying the intended scope of our claims while agreeing to revisions that improve transparency without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim states that the proposed metrics 'consistently align with realized decision utility, while conventional metrics do not' across benchmarks and case studies, yet the abstract supplies no information on experimental design, data exclusion rules, error bars, statistical tests, or the elicitation procedure for the decision utilities and priors. Without these details the claim cannot be evaluated and may rest on post-hoc selection.

Authors: The abstract is intentionally concise to highlight the conceptual contribution. Full experimental details—including benchmark designs, case-study protocols, utility elicitation from domain experts, and statistical reporting—are provided in Sections 4–5 of the manuscript. We agree the abstract could better signal this scope and will revise it to note the use of standard benchmarks, real-world case studies with elicited utilities, and that results include variability measures across runs. The utilities were selected a priori from common decision tasks in the literature rather than post-hoc; the manuscript does not claim universality but shows consistent patterns across the chosen settings. revision: partial
Referee: [Framework definition and experimental setup] Framework definition and experimental setup: the decision-alignment criterion and prior-weighted utility metrics are defined relative to the authors' chosen utility functions and priors. The reported alignment therefore holds only inside those choices. No independent verification is described that the selected utilities match the true downstream problems encountered in deployment, nor are robustness results shown under plausible alternative utilities; this directly affects whether the generalization that 'many widely used uncertainty metrics are either misaligned ... or encode pathological prior beliefs' follows.

Authors: Decision alignment is defined relative to a utility and prior by construction; this is the central methodological point, as generic metrics cannot be expected to align with every possible downstream task. The utilities used are drawn from standard decision problems in classification/regression and from practitioner-elicited preferences in the case studies. While the paper does not claim these exhaust all deployment scenarios, it demonstrates misalignment of conventional metrics for these representative choices. We agree that explicit robustness checks under alternative utilities would strengthen the generalization claim and will add sensitivity analyses in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: framework derived from external decision theory

full rationale

The paper introduces a decision-alignment criterion and prior-weighted utility metrics as a class of proper scoring rules grounded in decision-theoretic concepts. No equations or definitions in the abstract reduce the proposed metrics to fitted parameters or experimental outcomes by construction. The claim that the new metrics align with realized decision utility while conventional ones do not is presented as an empirical finding from benchmarks and case studies, not a definitional tautology. The derivation chain relies on external decision theory rather than self-referential fitting or self-citation chains, making the result self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on standard decision theory and proper scoring rule properties drawn from prior literature; no free parameters or invented physical entities are introduced. The main new constructs are conceptual (decision-alignment test and prior-weighted metrics).

axioms (2)

domain assumption Downstream tasks admit well-defined utility functions that can be used to evaluate decisions
Invoked when defining decision-alignment and the proposed metrics; central to the claim that conventional metrics are misaligned.
domain assumption Proper scoring rules can be adapted via prior weighting to produce decision-aligned evaluation
Basis for proposing the new metric class.

invented entities (2)

decision-alignment criterion no independent evidence
purpose: Test whether an uncertainty metric aligns with downstream decision utility
New conceptual tool introduced to diagnose misalignment in existing metrics.
prior-weighted utility metrics no independent evidence
purpose: Class of proper scoring rules for decision-aligned UQ evaluation
Proposed as the solution class that satisfies the alignment criterion.

pith-pipeline@v0.9.1-grok · 5666 in / 1428 out tokens · 18295 ms · 2026-06-26T05:40:21.164873+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 27 canonical work pages

[1]

Sculley, Sebastian Nowozin, Joshua V

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V . Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InProceedings of the 33rd International Conference on Neural Information Processing Systems, volume 32. Curran Associa...

2019
[2]

URSABench: A system for comprehensive benchmarking of bayesian deep neural net- work models and inference methods

Meet P Vadera, Jinyang Li, Adam D Cobb, Brian Jalaian, Tarek Abdelzaher, and Benjamin M Marlin. URSABench: A system for comprehensive benchmarking of bayesian deep neural net- work models and inference methods. InProceedings of Machine Learning and Systems, volume 4, pages 217–237, 2022. URL https://proceedings.mlsys.org/paper_files/paper/2022/file/ 4a420...

2022
[3]

A benchmark on uncertainty quantifi- cation for deep learning prognostics.Reliability Engineering & System Safety, 253:110513, 2025

Luis Basora, Arthur Viens, Manuel Arias Chao, and Xavier Olive. A benchmark on uncertainty quantifi- cation for deep learning prognostics.Reliability Engineering & System Safety, 253:110513, 2025. URL https://linkinghub.elsevier.com/retrieve/pii/S0951832024005854

2025
[4]

Vickers and Elena B

Andrew J. Vickers and Elena B. Elkin. Decision curve analysis: A novel method for evaluating prediction models.Medical Decision Making, 26(6):565–574, 2006. URL https://journals.sagepub.com/ doi/10.1177/0272989X06295361

work page doi:10.1177/0272989x06295361 2006
[5]

Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M

Michael W. Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M. Dai. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 204–213. ACM, 2020. URLhttps://dl.acm.org/doi/10.1145/3368555.3384457

work page doi:10.1145/3368555.3384457 2020
[6]

Evaluating posterior probabilities: Decision theory, proper scoring rules, and calibration.Transactions on Machine Learning Research, 2025

Luciana Ferrer and Daniel Ramos. Evaluating posterior probabilities: Decision theory, proper scoring rules, and calibration.Transactions on Machine Learning Research, 2025. URL https://openreview. net/forum?id=qbrE0LR7fF

2025
[7]

Patterns, predictions, and actions: A story about machine learning,

Moritz Hardt and Benjamin Recht. Patterns, predictions, and actions: A story about machine learning,
[8]

URLhttp://arxiv.org/abs/2102.05242

arXiv
[9]

Zachary Nado, Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Qixuan Feng, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel, Jeremiah Liu, Zelda Mariet, Jeremy Nixon, Shreyas Padhy, Jie Ren, Tim G. J. Rudner, Faris Sbahi, Yeming Wen, Florian Wenzel, Kevin Murphy, D. Sculley, Balaji Lakshminarayanan, Jaspe...

arXiv 2022
[10]

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007. URL http://www.tandfonline. com/doi/abs/10.1198/016214506000001437

work page doi:10.1198/016214506000001437 2007
[11]

Measuring, refining and calibrating speaker and language in- formation extracted from speech, 2010

Niko Brümmer. Measuring, refining and calibrating speaker and language in- formation extracted from speech, 2010. URL https://scispace.com/pdf/ measuring-refining-and-calibrating-speaker-and-language-53glvzuiuj.pdf . PhD Thesis

2010
[12]

Romain Pic, Clément Dombry, Philippe Naveau, and Maxime Taillardat. Proper scoring rules for multi- variate probabilistic forecasts based on aggregation and transformation.Advances in Statistical Clima- tology, Meteorology and Oceanography, 11(1):23–58, 2025. URL https://ascmo.copernicus.org/ articles/11/23/2025/

2025
[13]

The neu- ral testbed: Evaluating joint predictions

Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Brendan O’Donoghue, Botao Hao, Dieterich Lawson, Morteza Ibrahimi, Xiuyuan Lu, Vikranth Dwaracherla, and Benjamin Van Roy. The neu- ral testbed: Evaluating joint predictions. InProceedings of the 36th International Conference on Neural Information Processing Systems, volume 35 ofNIPS ’22, pages 12554–12565. C...

2022
[14]

Theory and applications of proper scoring rules.METRON, 72(2):169–183, 2014

Alexander Philip Dawid and Monica Musio. Theory and applications of proper scoring rules.METRON, 72(2):169–183, 2014. URLhttp://link.springer.com/10.1007/s40300-014-0039-y

work page doi:10.1007/s40300-014-0039-y 2014
[15]

Leonard J. Savage. Elicitation of personal probabilities and expectations.Journal of the American Statistical Association, 66(336):783–801, 1971. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.1971. 10482346. URLhttp://www.tandfonline.com/doi/abs/10.1080/01621459.1971.10482346

work page doi:10.1080/01621459.1971 1971
[16]

Schervish

Mark J. Schervish. A general method for comparing probability assessors.The Annals of Statistics, 17(4): 1856–1879, 1989. URLhttps://www.jstor.org/stable/2241668

arXiv 1989
[17]

Loss functions for binary class probability estimation and classification: Structure and applications, 2005

Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications, 2005. URL http://stat.wharton.upenn.edu/~buja/ PAPERS/paper-proper-scoring.pdf

2005
[18]

Reid and Robert C

Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experiments.Journal of Machine Learning Research, 12:731–817, 2011. URL https://jmlr.csail.mit.edu/papers/ v12/reid11a.html

2011
[19]

Werner Ehm, Tilmann Gneiting, Alexander Jordan, and Fabian Krüger. Of quantiles and expectiles: Consistent scoring functions, choquet representations and forecast rankings.Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3):505–562, 2016. URL https://academic.oup.com/ jrsssb/article/78/3/505/7040984

2016
[20]

Well-calibrated regression uncertainty in medical imaging with deep learning

Max-Heinrich Laves, Sontje Ihler, Jacob F Fast, Luder A Kahrs, and Tobias Ortmaier. Well-calibrated regression uncertainty in medical imaging with deep learning. InProceedings of the Third Conference on Medical Imaging with Deep Learning, volume 121, pages 393–412. PMLR, 2020. URL https: //proceedings.mlr.press/v121/laves20a.html

2020
[21]

Role of calibration in uncertainty- based referral for deep learning.Statistical Methods in Medical Research, 32(5):927–943, 2023

Ruotao Zhang, Constantine Gatsonis, and Jon Arni Steingrimsson. Role of calibration in uncertainty- based referral for deep learning.Statistical Methods in Medical Research, 32(5):927–943, 2023. URL https://journals.sagepub.com/doi/10.1177/09622802231158811

work page doi:10.1177/09622802231158811 2023
[22]

Gruber and Florian Buettner

Sebastian G. Gruber and Florian Buettner. Better uncertainty calibration via proper scores for classification and beyond. In36th Conference on Neural Information Processing Systems, 2022. URL http://arxiv. org/abs/2203.07835

arXiv 2022
[23]

U-calibration: Forecasting for an unknown agent

Robert Kleinberg, Renato Paes Leme, Jon Schneider, and Yifeng Teng. U-calibration: Forecasting for an unknown agent. InProceedings of Machine Learning Research, volume 195, pages 1–3, 2023. URL https://proceedings.mlr.press/v195/kleinberg23a/kleinberg23a.pdf

2023
[24]

Calibration error for decision making, 2024

Lunjia Hu and Yifan Wu. Calibration error for decision making, 2024. URL http://arxiv.org/abs/ 2404.13503

arXiv 2024
[25]

Truthfulness of decision-theoretic calibration measures, 2025

Mingda Qiao and Eric Zhao. Truthfulness of decision-theoretic calibration measures, 2025. URL http://arxiv.org/abs/2503.02384

arXiv 2025
[26]

Cost-aware calibration of classifiers.INFORMS Journal on Data Science, 4 (2):101–113, 2025

Mochen Yang and Xuan Bi. Cost-aware calibration of classifiers.INFORMS Journal on Data Science, 4 (2):101–113, 2025. URLhttps://pubsonline.informs.org/doi/10.1287/ijds.2024.0038

work page doi:10.1287/ijds.2024.0038 2025
[27]

Jordan, and Aymeric Dieuleveut

Mahmoud Hegazy, Michael I. Jordan, and Aymeric Dieuleveut. Scalable utility-aware multiclass calibration,
[28]

URLhttp://arxiv.org/abs/2510.25458

arXiv
[29]

Rank-based decomposable losses in machine learning: A survey

Shu Hu, Xin Wang, and Siwei Lyu. Rank-based decomposable losses in machine learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2023. URL https: //ieeexplore.ieee.org/document/10184478/

arXiv 2023
[30]

The foundations of cost-sensitive learning

Charles Elkan. The foundations of cost-sensitive learning. InProceedings of the 17th International Joint Conference on Artificial Intelligence, volume 2 ofIJCAI’01, pages 973–978. Morgan Kaufmann Publishers Inc., 2001. URLhttps://dl.acm.org/doi/abs/10.5555/1642194.1642224

work page doi:10.5555/1642194.1642224 2001
[31]

Quality metrics in recommender systems: Do we calculate metrics consistently? InFifteenth ACM Conference on Recommender Systems, pages 708–713, 2021

Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. Quality metrics in recommender systems: Do we calculate metrics consistently? InFifteenth ACM Conference on Recommender Systems, pages 708–713, 2021. URLhttp://arxiv.org/abs/2206.12858

arXiv 2021
[32]

A no-regret generalization of hierarchical softmax to extreme multi-label classification

Marek Wydmuch, Kalina Jasinska, Mikhail Kuznetsov, Róbert Busa-Fekete, and Krzysztof Dembczy´nski. A no-regret generalization of hierarchical softmax to extreme multi-label classification. In32nd Conference on Neural Information Processing Systems, 2018. URLhttp://arxiv.org/abs/1810.11671. 11

Pith/arXiv arXiv 2018
[33]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 220–229, New York, NY , USA,
[34]

D., & Gebru, T

Association for Computing Machinery. URLhttps://doi.org/10.1145/3287560.3287596

work page doi:10.1145/3287560.3287596
[35]

Day-ahead bidding strategies for wind farm operators under a one-price balancing scheme

Max Bruninx, Timothy Verstraeten, Jalal Kazempour, and Jan Helsen. Day-ahead bidding strategies for wind farm operators under a one-price balancing scheme. InProceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems, pages 719–726. ACM, 2025. URL https: //dl.acm.org/doi/10.1145/3679240.3734639

work page doi:10.1145/3679240.3734639 2025
[36]

Pitfalls of epistemic uncertainty quantifica- tion through loss minimisation

Viktor Bengs, Eyke Hüllermeier, and Willem Waegeman. Pitfalls of epistemic uncertainty quantifica- tion through loss minimisation. InAdvances in Neural Information Processing Systems, volume 35, pages 29205–29216. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_ files/paper/2022/file/bc1d640f841f752c689aae20b31198c1-Paper-Conference.pdf

2022
[37]

On second-order scoring rules for epistemic uncertainty quantification

Viktor Bengs, Eyke Hüllermeier, and Willem Waegeman. On second-order scoring rules for epistemic uncertainty quantification. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 2078–2091. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/ v202/bengs23a.html

2078
[38]

Random-set neural networks

Shireen Kudukkil Manchingal, Muhammad Mubashar, Kaizheng Wang, Keivan K1 Shariatmadar, and Fabio Cuzzolin. Random-set neural networks. InInternational Conference on Learning Representations,
[39]

URLhttps://openreview.net/forum?id=pdjkikvCch
[40]

Shireen Kudukkil Manchingal, Andrew Bradley, Julian F. P. Kooij, Keivan K1 Shariatmadar, Neil Yorke- Smith, and Fabio Cuzzolin. Epistemic artificial intelligence is essential for machine learning models to truly’know when they do not know’, 2025. URLhttps://arxiv.org/abs/2505.04950

arXiv 2025
[41]

Allen Bradley and Stuart S

A. Allen Bradley and Stuart S. Schwartz. Summary verification measures and their interpretation for ensemble forecasts.Monthly Weather Review, 139(9):3075–3089, 2011. URL https://journals. ametsoc.org/doi/10.1175/2010MWR3305.1

work page doi:10.1175/2010mwr3305.1 2011
[42]

Aligning the evaluation of probabilistic predictions with downstream value, 2025

Novin Shahroudi, Viacheslav Komisarenko, and Meelis Kull. Aligning the evaluation of probabilistic predictions with downstream value, 2025. URLhttp://arxiv.org/abs/2508.18251

arXiv 2025
[43]

Williamson

Rabanus Derr and Robert C. Williamson. Forecast evaluation and the relationship of regret and calibration,
[44]

URLhttp://arxiv.org/abs/2401.14483

Pith/arXiv arXiv
[45]

Aligning evaluation with clinical priorities: Calibration, label shift, and error costs

Gerardo A Flores, Alyssa H Smith, Julia A Fukuyama, and Ashia C Wilson. Aligning evaluation with clinical priorities: Calibration, label shift, and error costs. InThe 39th Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=b7Ka31HvXU

2025
[46]

Smith, Julia A

Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A. Fukuyama, and Ashia C. Wilson. A con- sequentialist critique of binary classification evaluation: Theory, practice, and tools, 2026. URL http://arxiv.org/abs/2504.04528

arXiv 2026
[47]

Calibrated reliable regression using maximum mean discrepancy

Peng Cui, Wenbo Hu, and Jun Zhu. Calibrated reliable regression using maximum mean discrepancy. In Proceedings of the 34th International Conference on Neural Information Processing Systems, volume 33 of NIPS ’20, pages 17164–17175. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/c74c4bf0dad9cbae3d80faa054b7d8...

2020
[48]

Andrey Malinin, Neil Band, Ganshin, Alexander, German Chesnokov, Yarin Gal, Mark J. F. Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Roginskiy, Denis, Mariya Shmatova, Panos Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. In35th Confere...

2021
[49]

Towards clear expecta- tions for uncertainty estimation, 2022

Victor Bouvier, Simona Maggio, Alexandre Abraham, and Léo Dreyfus-Schmidt. Towards clear expecta- tions for uncertainty estimation, 2022. URLhttp://arxiv.org/abs/2207.13341

arXiv 2022
[50]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1321–1330. PMLR, 2017. URLhttps://proceedings.mlr.press/v70/guo17a.html

2017
[51]

Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, and Arto Klami

Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, and Arto Klami. Prior knowledge elicitation: The past, present, and future.Bayesian Analysis, 19(4):1129–1161,
[52]

URLhttps://doi.org/10.1214/23-BA1381. 12

work page doi:10.1214/23-ba1381
[53]

Scalable variational gaussian process classification

James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational gaussian process classification. InProceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38, pages 351–360. PMLR, 2015. URL https://proceedings.mlr.press/v38/ hensman15.html

2015
[54]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe 11th International Conference on Learning Representations, 2023. URLhttps://openreview.net/pdf?id=cp5PvcI6w8_

2023
[55]

Revisiting deep learning models for tabular data

Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. InAdvances in Neural Information Processing Systems, volume 34, pages 18932– 18943. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/file/9d86d83f925f2149e9edb0ac3b49229c-Paper.pdf

2021
[56]

Bayan Bruss, and Tom Goldstein

Gowthami Somepalli, Avi Schwarzschild, Micah Goldblum, C. Bayan Bruss, and Tom Goldstein. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. In NeurIPS 2022 First Table Representation Workshop, 2022. URL https://openreview.net/forum? id=FiyUTAy4sB8

2022
[57]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/ paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf

2017
[58]

NGBoost: Natural gradient boosting for probabilistic prediction

Tony Duan, Anand Avati, Daisy Yi Ding, Khanh K Thai, Sanjay Basu, Andrew Ng, and Alejandro Schuler. NGBoost: Natural gradient boosting for probabilistic prediction. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research. PMLR,
[59]

URLhttps://proceedings.mlr.press/v119/duan20a.html
[60]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011. URLhttps://jmlr.org/papers/v12/pedregosa11a.html

2011
[61]

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration

Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. InAdvances in Neural In- formation Processing Systems, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 27e8e17134dd7083b050476733207ea1-Abstract.html

2018
[62]

The UCI machine learning repository, 2025

Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. The UCI machine learning repository, 2025. URLhttps://archive.ics.uci.edu

2025
[63]

Bank marketing, 2014

S Moro, P Rita, and P Cortez. Bank marketing, 2014. URLhttps://doi.org/10.24432/C5K306

work page doi:10.24432/c5k306 2014
[64]

Heart disease, 1989

Andras Janosi, William Steinbrunn, Matthias Pfisterer, and Robert Detrano. Heart disease, 1989. URL https://doi.org/10.24432/C52P4X

work page doi:10.24432/c52p4x 1989
[65]

Ionosphere, 1989

V Sigillito, S Wing, L Hutton, and K Baker. Ionosphere, 1989. URL https://doi.org/10.24432/ C5W01B

1989
[66]

Mushroom, 1981

Jeff Schlimmer. Mushroom, 1981. URLhttps://doi.org/10.24432/C5959T

work page doi:10.24432/c5959t 1981
[67]

Connectionist bench (sonar, mines vs

T Sejnowski and R Gorman. Connectionist bench (sonar, mines vs. rocks), 1988. URL https://doi. org/10.24432/C5T01Q

work page doi:10.24432/c5t01q 1988
[68]

Air quality, 2008

S Vito. Air quality, 2008. URLhttps://doi.org/10.24432/C59K5F

work page doi:10.24432/c59k5f 2008
[69]

Auto MPG, 1993

R Quinlan. Auto MPG, 1993. URLhttps://doi.org/10.24432/C5859H

work page doi:10.24432/c5859h 1993
[70]

Energy efficiency, 2012

Athanasios Tsanas and Angeliki Xifara. Energy efficiency, 2012. URL https://doi.org/10.24432/ C51307

2012
[71]

Combined cycle power plant, 2014

P Tfekci and H Kaya. Combined cycle power plant, 2014. URL https://doi.org/10.24432/C5002N

work page doi:10.24432/c5002n 2014
[72]

Wine quality, 2009

P Cortez, A Cerdeira, F Almeida, T Matos, and J Reis. Wine quality, 2009. URL https://doi.org/10. 24432/C56S3T

2009
[73]

Elia open data platform, 2025

Elia. Elia open data platform, 2025. URLhttps://opendata.elia.be. 13

2025
[74]

ENTSO-e transparency platform,

European Network of Transmission System Operators for Electricity. ENTSO-e transparency platform,
[75]

URLhttps://transparency.entsoe.eu
[76]

Example-dependent cost-sensitive logistic regression for credit scoring

Alejandro Correa Bahnsen, Djamia Aouada, and Bjorn Ottersten. Example-dependent cost-sensitive logistic regression for credit scoring. In13th International Conference on Machine Learning and Applications, pages 263–269. IEEE, 2014. URLhttp://ieeexplore.ieee.org/document/7033125/

arXiv 2014
[77]

Credit risk assessment on a private label credit card application, 2009

PAKDD. Credit risk assessment on a private label credit card application, 2009. URL https://pakdd. org/archive/pakdd2009/front/show/competition.html. Data Mining Competition

2009
[78]

GiveMeSomeCredit, 2011

Will Cukierski. GiveMeSomeCredit, 2011. URL https://kaggle.com/competitions/ GiveMeSomeCredit. Credit Fusion

2011
[79]

costcla, 2016

Alejandro Correa Bahnsen. costcla, 2016. URL https://github.com/albahnsen/ CostSensitiveClassification

2016
[80]

Data-driven optimization of peer-to-peer lending portfolios based on the expected value framework.Intelligent Systems in Accounting, Finance and Management, 28(2):119–129, 2021

Ajay Byanjankar, József Mezei, and Markku Heikkilä. Data-driven optimization of peer-to-peer lending portfolios based on the expected value framework.Intelligent Systems in Accounting, Finance and Management, 28(2):119–129, 2021. URL https://onlinelibrary.wiley.com/doi/10.1002/isaf. 1490

work page doi:10.1002/isaf 2021

Showing first 80 references.

[1] [1]

Sculley, Sebastian Nowozin, Joshua V

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V . Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InProceedings of the 33rd International Conference on Neural Information Processing Systems, volume 32. Curran Associa...

2019

[2] [2]

URSABench: A system for comprehensive benchmarking of bayesian deep neural net- work models and inference methods

Meet P Vadera, Jinyang Li, Adam D Cobb, Brian Jalaian, Tarek Abdelzaher, and Benjamin M Marlin. URSABench: A system for comprehensive benchmarking of bayesian deep neural net- work models and inference methods. InProceedings of Machine Learning and Systems, volume 4, pages 217–237, 2022. URL https://proceedings.mlsys.org/paper_files/paper/2022/file/ 4a420...

2022

[3] [3]

A benchmark on uncertainty quantifi- cation for deep learning prognostics.Reliability Engineering & System Safety, 253:110513, 2025

Luis Basora, Arthur Viens, Manuel Arias Chao, and Xavier Olive. A benchmark on uncertainty quantifi- cation for deep learning prognostics.Reliability Engineering & System Safety, 253:110513, 2025. URL https://linkinghub.elsevier.com/retrieve/pii/S0951832024005854

2025

[4] [4]

Vickers and Elena B

Andrew J. Vickers and Elena B. Elkin. Decision curve analysis: A novel method for evaluating prediction models.Medical Decision Making, 26(6):565–574, 2006. URL https://journals.sagepub.com/ doi/10.1177/0272989X06295361

work page doi:10.1177/0272989x06295361 2006

[5] [5]

Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M

Michael W. Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M. Dai. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 204–213. ACM, 2020. URLhttps://dl.acm.org/doi/10.1145/3368555.3384457

work page doi:10.1145/3368555.3384457 2020

[6] [6]

Evaluating posterior probabilities: Decision theory, proper scoring rules, and calibration.Transactions on Machine Learning Research, 2025

Luciana Ferrer and Daniel Ramos. Evaluating posterior probabilities: Decision theory, proper scoring rules, and calibration.Transactions on Machine Learning Research, 2025. URL https://openreview. net/forum?id=qbrE0LR7fF

2025

[7] [7]

Patterns, predictions, and actions: A story about machine learning,

Moritz Hardt and Benjamin Recht. Patterns, predictions, and actions: A story about machine learning,

[8] [8]

URLhttp://arxiv.org/abs/2102.05242

arXiv

[9] [9]

Zachary Nado, Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Qixuan Feng, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel, Jeremiah Liu, Zelda Mariet, Jeremy Nixon, Shreyas Padhy, Jie Ren, Tim G. J. Rudner, Faris Sbahi, Yeming Wen, Florian Wenzel, Kevin Murphy, D. Sculley, Balaji Lakshminarayanan, Jaspe...

arXiv 2022

[10] [10]

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007. URL http://www.tandfonline. com/doi/abs/10.1198/016214506000001437

work page doi:10.1198/016214506000001437 2007

[11] [11]

Measuring, refining and calibrating speaker and language in- formation extracted from speech, 2010

Niko Brümmer. Measuring, refining and calibrating speaker and language in- formation extracted from speech, 2010. URL https://scispace.com/pdf/ measuring-refining-and-calibrating-speaker-and-language-53glvzuiuj.pdf . PhD Thesis

2010

[12] [12]

Romain Pic, Clément Dombry, Philippe Naveau, and Maxime Taillardat. Proper scoring rules for multi- variate probabilistic forecasts based on aggregation and transformation.Advances in Statistical Clima- tology, Meteorology and Oceanography, 11(1):23–58, 2025. URL https://ascmo.copernicus.org/ articles/11/23/2025/

2025

[13] [13]

The neu- ral testbed: Evaluating joint predictions

Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Brendan O’Donoghue, Botao Hao, Dieterich Lawson, Morteza Ibrahimi, Xiuyuan Lu, Vikranth Dwaracherla, and Benjamin Van Roy. The neu- ral testbed: Evaluating joint predictions. InProceedings of the 36th International Conference on Neural Information Processing Systems, volume 35 ofNIPS ’22, pages 12554–12565. C...

2022

[14] [14]

Theory and applications of proper scoring rules.METRON, 72(2):169–183, 2014

Alexander Philip Dawid and Monica Musio. Theory and applications of proper scoring rules.METRON, 72(2):169–183, 2014. URLhttp://link.springer.com/10.1007/s40300-014-0039-y

work page doi:10.1007/s40300-014-0039-y 2014

[15] [15]

Leonard J. Savage. Elicitation of personal probabilities and expectations.Journal of the American Statistical Association, 66(336):783–801, 1971. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.1971. 10482346. URLhttp://www.tandfonline.com/doi/abs/10.1080/01621459.1971.10482346

work page doi:10.1080/01621459.1971 1971

[16] [16]

Schervish

Mark J. Schervish. A general method for comparing probability assessors.The Annals of Statistics, 17(4): 1856–1879, 1989. URLhttps://www.jstor.org/stable/2241668

arXiv 1989

[17] [17]

Loss functions for binary class probability estimation and classification: Structure and applications, 2005

Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications, 2005. URL http://stat.wharton.upenn.edu/~buja/ PAPERS/paper-proper-scoring.pdf

2005

[18] [18]

Reid and Robert C

Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experiments.Journal of Machine Learning Research, 12:731–817, 2011. URL https://jmlr.csail.mit.edu/papers/ v12/reid11a.html

2011

[19] [19]

Werner Ehm, Tilmann Gneiting, Alexander Jordan, and Fabian Krüger. Of quantiles and expectiles: Consistent scoring functions, choquet representations and forecast rankings.Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3):505–562, 2016. URL https://academic.oup.com/ jrsssb/article/78/3/505/7040984

2016

[20] [20]

Well-calibrated regression uncertainty in medical imaging with deep learning

Max-Heinrich Laves, Sontje Ihler, Jacob F Fast, Luder A Kahrs, and Tobias Ortmaier. Well-calibrated regression uncertainty in medical imaging with deep learning. InProceedings of the Third Conference on Medical Imaging with Deep Learning, volume 121, pages 393–412. PMLR, 2020. URL https: //proceedings.mlr.press/v121/laves20a.html

2020

[21] [21]

Role of calibration in uncertainty- based referral for deep learning.Statistical Methods in Medical Research, 32(5):927–943, 2023

Ruotao Zhang, Constantine Gatsonis, and Jon Arni Steingrimsson. Role of calibration in uncertainty- based referral for deep learning.Statistical Methods in Medical Research, 32(5):927–943, 2023. URL https://journals.sagepub.com/doi/10.1177/09622802231158811

work page doi:10.1177/09622802231158811 2023

[22] [22]

Gruber and Florian Buettner

Sebastian G. Gruber and Florian Buettner. Better uncertainty calibration via proper scores for classification and beyond. In36th Conference on Neural Information Processing Systems, 2022. URL http://arxiv. org/abs/2203.07835

arXiv 2022

[23] [23]

U-calibration: Forecasting for an unknown agent

Robert Kleinberg, Renato Paes Leme, Jon Schneider, and Yifeng Teng. U-calibration: Forecasting for an unknown agent. InProceedings of Machine Learning Research, volume 195, pages 1–3, 2023. URL https://proceedings.mlr.press/v195/kleinberg23a/kleinberg23a.pdf

2023

[24] [24]

Calibration error for decision making, 2024

Lunjia Hu and Yifan Wu. Calibration error for decision making, 2024. URL http://arxiv.org/abs/ 2404.13503

arXiv 2024

[25] [25]

Truthfulness of decision-theoretic calibration measures, 2025

Mingda Qiao and Eric Zhao. Truthfulness of decision-theoretic calibration measures, 2025. URL http://arxiv.org/abs/2503.02384

arXiv 2025

[26] [26]

Cost-aware calibration of classifiers.INFORMS Journal on Data Science, 4 (2):101–113, 2025

Mochen Yang and Xuan Bi. Cost-aware calibration of classifiers.INFORMS Journal on Data Science, 4 (2):101–113, 2025. URLhttps://pubsonline.informs.org/doi/10.1287/ijds.2024.0038

work page doi:10.1287/ijds.2024.0038 2025

[27] [27]

Jordan, and Aymeric Dieuleveut

Mahmoud Hegazy, Michael I. Jordan, and Aymeric Dieuleveut. Scalable utility-aware multiclass calibration,

[28] [28]

URLhttp://arxiv.org/abs/2510.25458

arXiv

[29] [29]

Rank-based decomposable losses in machine learning: A survey

Shu Hu, Xin Wang, and Siwei Lyu. Rank-based decomposable losses in machine learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2023. URL https: //ieeexplore.ieee.org/document/10184478/

arXiv 2023

[30] [30]

The foundations of cost-sensitive learning

Charles Elkan. The foundations of cost-sensitive learning. InProceedings of the 17th International Joint Conference on Artificial Intelligence, volume 2 ofIJCAI’01, pages 973–978. Morgan Kaufmann Publishers Inc., 2001. URLhttps://dl.acm.org/doi/abs/10.5555/1642194.1642224

work page doi:10.5555/1642194.1642224 2001

[31] [31]

Quality metrics in recommender systems: Do we calculate metrics consistently? InFifteenth ACM Conference on Recommender Systems, pages 708–713, 2021

Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. Quality metrics in recommender systems: Do we calculate metrics consistently? InFifteenth ACM Conference on Recommender Systems, pages 708–713, 2021. URLhttp://arxiv.org/abs/2206.12858

arXiv 2021

[32] [32]

A no-regret generalization of hierarchical softmax to extreme multi-label classification

Marek Wydmuch, Kalina Jasinska, Mikhail Kuznetsov, Róbert Busa-Fekete, and Krzysztof Dembczy´nski. A no-regret generalization of hierarchical softmax to extreme multi-label classification. In32nd Conference on Neural Information Processing Systems, 2018. URLhttp://arxiv.org/abs/1810.11671. 11

Pith/arXiv arXiv 2018

[33] [33]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 220–229, New York, NY , USA,

[34] [34]

D., & Gebru, T

Association for Computing Machinery. URLhttps://doi.org/10.1145/3287560.3287596

work page doi:10.1145/3287560.3287596

[35] [35]

Day-ahead bidding strategies for wind farm operators under a one-price balancing scheme

Max Bruninx, Timothy Verstraeten, Jalal Kazempour, and Jan Helsen. Day-ahead bidding strategies for wind farm operators under a one-price balancing scheme. InProceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems, pages 719–726. ACM, 2025. URL https: //dl.acm.org/doi/10.1145/3679240.3734639

work page doi:10.1145/3679240.3734639 2025

[36] [36]

Pitfalls of epistemic uncertainty quantifica- tion through loss minimisation

Viktor Bengs, Eyke Hüllermeier, and Willem Waegeman. Pitfalls of epistemic uncertainty quantifica- tion through loss minimisation. InAdvances in Neural Information Processing Systems, volume 35, pages 29205–29216. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_ files/paper/2022/file/bc1d640f841f752c689aae20b31198c1-Paper-Conference.pdf

2022

[37] [37]

On second-order scoring rules for epistemic uncertainty quantification

Viktor Bengs, Eyke Hüllermeier, and Willem Waegeman. On second-order scoring rules for epistemic uncertainty quantification. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 2078–2091. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/ v202/bengs23a.html

2078

[38] [38]

Random-set neural networks

Shireen Kudukkil Manchingal, Muhammad Mubashar, Kaizheng Wang, Keivan K1 Shariatmadar, and Fabio Cuzzolin. Random-set neural networks. InInternational Conference on Learning Representations,

[39] [39]

URLhttps://openreview.net/forum?id=pdjkikvCch

[40] [40]

Shireen Kudukkil Manchingal, Andrew Bradley, Julian F. P. Kooij, Keivan K1 Shariatmadar, Neil Yorke- Smith, and Fabio Cuzzolin. Epistemic artificial intelligence is essential for machine learning models to truly’know when they do not know’, 2025. URLhttps://arxiv.org/abs/2505.04950

arXiv 2025

[41] [41]

Allen Bradley and Stuart S

A. Allen Bradley and Stuart S. Schwartz. Summary verification measures and their interpretation for ensemble forecasts.Monthly Weather Review, 139(9):3075–3089, 2011. URL https://journals. ametsoc.org/doi/10.1175/2010MWR3305.1

work page doi:10.1175/2010mwr3305.1 2011

[42] [42]

Aligning the evaluation of probabilistic predictions with downstream value, 2025

Novin Shahroudi, Viacheslav Komisarenko, and Meelis Kull. Aligning the evaluation of probabilistic predictions with downstream value, 2025. URLhttp://arxiv.org/abs/2508.18251

arXiv 2025

[43] [43]

Williamson

Rabanus Derr and Robert C. Williamson. Forecast evaluation and the relationship of regret and calibration,

[44] [44]

URLhttp://arxiv.org/abs/2401.14483

Pith/arXiv arXiv

[45] [45]

Aligning evaluation with clinical priorities: Calibration, label shift, and error costs

Gerardo A Flores, Alyssa H Smith, Julia A Fukuyama, and Ashia C Wilson. Aligning evaluation with clinical priorities: Calibration, label shift, and error costs. InThe 39th Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=b7Ka31HvXU

2025

[46] [46]

Smith, Julia A

Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A. Fukuyama, and Ashia C. Wilson. A con- sequentialist critique of binary classification evaluation: Theory, practice, and tools, 2026. URL http://arxiv.org/abs/2504.04528

arXiv 2026

[47] [47]

Calibrated reliable regression using maximum mean discrepancy

Peng Cui, Wenbo Hu, and Jun Zhu. Calibrated reliable regression using maximum mean discrepancy. In Proceedings of the 34th International Conference on Neural Information Processing Systems, volume 33 of NIPS ’20, pages 17164–17175. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/c74c4bf0dad9cbae3d80faa054b7d8...

2020

[48] [48]

Andrey Malinin, Neil Band, Ganshin, Alexander, German Chesnokov, Yarin Gal, Mark J. F. Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Roginskiy, Denis, Mariya Shmatova, Panos Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. In35th Confere...

2021

[49] [49]

Towards clear expecta- tions for uncertainty estimation, 2022

Victor Bouvier, Simona Maggio, Alexandre Abraham, and Léo Dreyfus-Schmidt. Towards clear expecta- tions for uncertainty estimation, 2022. URLhttp://arxiv.org/abs/2207.13341

arXiv 2022

[50] [50]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1321–1330. PMLR, 2017. URLhttps://proceedings.mlr.press/v70/guo17a.html

2017

[51] [51]

Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, and Arto Klami

Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, and Arto Klami. Prior knowledge elicitation: The past, present, and future.Bayesian Analysis, 19(4):1129–1161,

[52] [52]

URLhttps://doi.org/10.1214/23-BA1381. 12

work page doi:10.1214/23-ba1381

[53] [53]

Scalable variational gaussian process classification

James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational gaussian process classification. InProceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38, pages 351–360. PMLR, 2015. URL https://proceedings.mlr.press/v38/ hensman15.html

2015

[54] [54]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe 11th International Conference on Learning Representations, 2023. URLhttps://openreview.net/pdf?id=cp5PvcI6w8_

2023

[55] [55]

Revisiting deep learning models for tabular data

Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. InAdvances in Neural Information Processing Systems, volume 34, pages 18932– 18943. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/file/9d86d83f925f2149e9edb0ac3b49229c-Paper.pdf

2021

[56] [56]

Bayan Bruss, and Tom Goldstein

Gowthami Somepalli, Avi Schwarzschild, Micah Goldblum, C. Bayan Bruss, and Tom Goldstein. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. In NeurIPS 2022 First Table Representation Workshop, 2022. URL https://openreview.net/forum? id=FiyUTAy4sB8

2022

[57] [57]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/ paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf

2017

[58] [58]

NGBoost: Natural gradient boosting for probabilistic prediction

Tony Duan, Anand Avati, Daisy Yi Ding, Khanh K Thai, Sanjay Basu, Andrew Ng, and Alejandro Schuler. NGBoost: Natural gradient boosting for probabilistic prediction. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research. PMLR,

[59] [59]

URLhttps://proceedings.mlr.press/v119/duan20a.html

[60] [60]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011. URLhttps://jmlr.org/papers/v12/pedregosa11a.html

2011

[61] [61]

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration

Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. InAdvances in Neural In- formation Processing Systems, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 27e8e17134dd7083b050476733207ea1-Abstract.html

2018

[62] [62]

The UCI machine learning repository, 2025

Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. The UCI machine learning repository, 2025. URLhttps://archive.ics.uci.edu

2025

[63] [63]

Bank marketing, 2014

S Moro, P Rita, and P Cortez. Bank marketing, 2014. URLhttps://doi.org/10.24432/C5K306

work page doi:10.24432/c5k306 2014

[64] [64]

Heart disease, 1989

Andras Janosi, William Steinbrunn, Matthias Pfisterer, and Robert Detrano. Heart disease, 1989. URL https://doi.org/10.24432/C52P4X

work page doi:10.24432/c52p4x 1989

[65] [65]

Ionosphere, 1989

V Sigillito, S Wing, L Hutton, and K Baker. Ionosphere, 1989. URL https://doi.org/10.24432/ C5W01B

1989

[66] [66]

Mushroom, 1981

Jeff Schlimmer. Mushroom, 1981. URLhttps://doi.org/10.24432/C5959T

work page doi:10.24432/c5959t 1981

[67] [67]

Connectionist bench (sonar, mines vs

T Sejnowski and R Gorman. Connectionist bench (sonar, mines vs. rocks), 1988. URL https://doi. org/10.24432/C5T01Q

work page doi:10.24432/c5t01q 1988

[68] [68]

Air quality, 2008

S Vito. Air quality, 2008. URLhttps://doi.org/10.24432/C59K5F

work page doi:10.24432/c59k5f 2008

[69] [69]

Auto MPG, 1993

R Quinlan. Auto MPG, 1993. URLhttps://doi.org/10.24432/C5859H

work page doi:10.24432/c5859h 1993

[70] [70]

Energy efficiency, 2012

Athanasios Tsanas and Angeliki Xifara. Energy efficiency, 2012. URL https://doi.org/10.24432/ C51307

2012

[71] [71]

Combined cycle power plant, 2014

P Tfekci and H Kaya. Combined cycle power plant, 2014. URL https://doi.org/10.24432/C5002N

work page doi:10.24432/c5002n 2014

[72] [72]

Wine quality, 2009

P Cortez, A Cerdeira, F Almeida, T Matos, and J Reis. Wine quality, 2009. URL https://doi.org/10. 24432/C56S3T

2009

[73] [73]

Elia open data platform, 2025

Elia. Elia open data platform, 2025. URLhttps://opendata.elia.be. 13

2025

[74] [74]

ENTSO-e transparency platform,

European Network of Transmission System Operators for Electricity. ENTSO-e transparency platform,

[75] [75]

URLhttps://transparency.entsoe.eu

[76] [76]

Example-dependent cost-sensitive logistic regression for credit scoring

Alejandro Correa Bahnsen, Djamia Aouada, and Bjorn Ottersten. Example-dependent cost-sensitive logistic regression for credit scoring. In13th International Conference on Machine Learning and Applications, pages 263–269. IEEE, 2014. URLhttp://ieeexplore.ieee.org/document/7033125/

arXiv 2014

[77] [77]

Credit risk assessment on a private label credit card application, 2009

PAKDD. Credit risk assessment on a private label credit card application, 2009. URL https://pakdd. org/archive/pakdd2009/front/show/competition.html. Data Mining Competition

2009

[78] [78]

GiveMeSomeCredit, 2011

Will Cukierski. GiveMeSomeCredit, 2011. URL https://kaggle.com/competitions/ GiveMeSomeCredit. Credit Fusion

2011

[79] [79]

costcla, 2016

Alejandro Correa Bahnsen. costcla, 2016. URL https://github.com/albahnsen/ CostSensitiveClassification

2016

[80] [80]

Data-driven optimization of peer-to-peer lending portfolios based on the expected value framework.Intelligent Systems in Accounting, Finance and Management, 28(2):119–129, 2021

Ajay Byanjankar, József Mezei, and Markku Heikkilä. Data-driven optimization of peer-to-peer lending portfolios based on the expected value framework.Intelligent Systems in Accounting, Finance and Management, 28(2):119–129, 2021. URL https://onlinelibrary.wiley.com/doi/10.1002/isaf. 1490

work page doi:10.1002/isaf 2021