pith. sign in

arxiv: 2606.26990 · v1 · pith:OPT7NIN3new · submitted 2026-06-25 · 💻 cs.LG · cs.AI· stat.ML

Decision-Aligned Evaluation of Uncertainty Quantification

Pith reviewed 2026-06-26 05:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords uncertainty quantificationdecision alignmentproper scoring rulesevaluation metricsmachine learningutility functionscalibration error
0
0 comments X

The pith

Many standard uncertainty metrics misalign with downstream decision utilities, while prior-weighted utility metrics align consistently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces decision-alignment as a criterion that checks whether an uncertainty metric's scores correspond to the actual utility gained from decisions made with those uncertainties. It finds that widely used metrics such as negative log-likelihood and expected calibration error either fail to match common decision problems or rest on unrealistic assumptions about the downstream task. The authors then define prior-weighted utility metrics, a subclass of proper scoring rules that incorporate priors over decision utilities to achieve alignment. Benchmark experiments and real-world case studies demonstrate that these new metrics track realized decision outcomes, whereas conventional ones do not. The work therefore identifies a gap in current uncertainty quantification evaluation practices and supplies a concrete way to close it.

Core claim

Decision-alignment is the criterion that reveals which uncertainty evaluation metrics meaningfully align with downstream utilities. Applying it shows that many widely used metrics are either misaligned with common decision problems or encode pathological prior beliefs about the downstream task. Prior-weighted utility metrics, defined as a special class of proper scoring rules, provide decision-aligned uncertainty evaluation. Across benchmark experiments and real-world case studies these metrics consistently align with realized decision utility while conventional metrics do not.

What carries the argument

The decision-alignment criterion, which tests whether a metric's performance on uncertainty estimates corresponds to the expected utility of decisions that use those estimates.

If this is right

  • Uncertainty evaluation protocols should incorporate explicit utility functions that reflect the intended downstream decisions.
  • Benchmarks that rely solely on calibration or likelihood metrics may systematically favor metrics that do not improve decisions.
  • Proper scoring rules can be extended to decision relevance by weighting them according to priors over possible utility functions.
  • Real-world deployment of uncertainty estimates should be validated against observed decision outcomes rather than generic scores alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be applied to other evaluation settings in machine learning where generic metrics are known to diverge from task-specific value.
  • Teams adopting this approach would need to articulate their decision utilities explicitly before selecting an evaluation metric.
  • The results suggest that metric choice itself encodes implicit assumptions about how uncertainty will be used, which can be made visible and adjustable.

Load-bearing premise

The downstream decision problems can be faithfully represented by the utility functions and priors used to define the proposed metrics and the decision-alignment criterion.

What would settle it

An experiment on a new decision task in which prior-weighted utility metrics exhibit lower correlation with realized decision utility than expected calibration error or negative log-likelihood.

Figures

Figures reproduced from arXiv: 2606.26990 by Annika Schneider, Joshua Stiller, Tommy Rochussen, Vincent Fortuin.

Figure 1
Figure 1. Figure 1: Prior choices for our classification (left) and regression (right) prior-weighted utility metrics. where ϕ = (k, γ) ∈ Φ := [n − 1] × R>0, and γ controls the degree of risk aversion. The Bayes act is to select the top-k instances according to the score µi − γσ2 i . This yields a selection policy that trades off between predicted benefit and uncertainty. As in classification, we can show that none of the met… view at source ↗
Figure 2
Figure 2. Figure 2: Metric–utility alignment in classification [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Uncertainty estimates in machine learning are typically evaluated using generic metrics such as the negative log-likelihood and expected calibration error, yet good performance on such metrics does not necessarily imply high utility in downstream decisions. We introduce decision-alignment, a criterion that reveals which evaluation metrics meaningfully align with downstream utilities. Applying this framework, we show that many widely used uncertainty metrics are either misaligned with common decision problems or encode pathological prior beliefs about the downstream task. We then propose prior-weighted utility metrics, a special class of proper scoring rules that provides decision-aligned uncertainty evaluation. Across benchmark experiments and real-world case studies, our metrics consistently align with realized decision utility, while conventional metrics do not. Our results surface flaws in the current UQ evaluation protocol and offer a principled extension of existing metrics toward decision-relevant UQ evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a decision-alignment criterion for assessing uncertainty quantification metrics according to their alignment with downstream decision utilities. It argues that standard metrics such as negative log-likelihood and expected calibration error are frequently misaligned with common decision problems or encode pathological priors about the task. The authors propose prior-weighted utility metrics as a class of proper scoring rules that achieve decision alignment. They report that these metrics consistently match realized decision utility across benchmark experiments and real-world case studies, whereas conventional metrics do not, and conclude that current UQ evaluation protocols require principled extension toward decision-relevant evaluation.

Significance. If the chosen utilities and priors are representative of actual deployment problems and the empirical results are robust, the work would provide a concrete way to make UQ evaluation decision-relevant rather than generic, addressing a recognized disconnect between calibration metrics and practical utility. The framing via external decision-theoretic concepts and the identification of a special class of proper scoring rules constitute a clear conceptual contribution; reproducible code or machine-checked elements are not mentioned.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim states that the proposed metrics 'consistently align with realized decision utility, while conventional metrics do not' across benchmarks and case studies, yet the abstract supplies no information on experimental design, data exclusion rules, error bars, statistical tests, or the elicitation procedure for the decision utilities and priors. Without these details the claim cannot be evaluated and may rest on post-hoc selection.
  2. [Framework definition and experimental setup] Framework definition and experimental setup: the decision-alignment criterion and prior-weighted utility metrics are defined relative to the authors' chosen utility functions and priors. The reported alignment therefore holds only inside those choices. No independent verification is described that the selected utilities match the true downstream problems encountered in deployment, nor are robustness results shown under plausible alternative utilities; this directly affects whether the generalization that 'many widely used uncertainty metrics are either misaligned ... or encode pathological prior beliefs' follows.
minor comments (1)
  1. [Abstract] The abstract refers to 'pathological prior beliefs' encoded by conventional metrics without giving concrete examples of those priors or the decision problems they distort; adding one or two explicit illustrations would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we address each major point directly, clarifying the intended scope of our claims while agreeing to revisions that improve transparency without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim states that the proposed metrics 'consistently align with realized decision utility, while conventional metrics do not' across benchmarks and case studies, yet the abstract supplies no information on experimental design, data exclusion rules, error bars, statistical tests, or the elicitation procedure for the decision utilities and priors. Without these details the claim cannot be evaluated and may rest on post-hoc selection.

    Authors: The abstract is intentionally concise to highlight the conceptual contribution. Full experimental details—including benchmark designs, case-study protocols, utility elicitation from domain experts, and statistical reporting—are provided in Sections 4–5 of the manuscript. We agree the abstract could better signal this scope and will revise it to note the use of standard benchmarks, real-world case studies with elicited utilities, and that results include variability measures across runs. The utilities were selected a priori from common decision tasks in the literature rather than post-hoc; the manuscript does not claim universality but shows consistent patterns across the chosen settings. revision: partial

  2. Referee: [Framework definition and experimental setup] Framework definition and experimental setup: the decision-alignment criterion and prior-weighted utility metrics are defined relative to the authors' chosen utility functions and priors. The reported alignment therefore holds only inside those choices. No independent verification is described that the selected utilities match the true downstream problems encountered in deployment, nor are robustness results shown under plausible alternative utilities; this directly affects whether the generalization that 'many widely used uncertainty metrics are either misaligned ... or encode pathological prior beliefs' follows.

    Authors: Decision alignment is defined relative to a utility and prior by construction; this is the central methodological point, as generic metrics cannot be expected to align with every possible downstream task. The utilities used are drawn from standard decision problems in classification/regression and from practitioner-elicited preferences in the case studies. While the paper does not claim these exhaust all deployment scenarios, it demonstrates misalignment of conventional metrics for these representative choices. We agree that explicit robustness checks under alternative utilities would strengthen the generalization claim and will add sensitivity analyses in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: framework derived from external decision theory

full rationale

The paper introduces a decision-alignment criterion and prior-weighted utility metrics as a class of proper scoring rules grounded in decision-theoretic concepts. No equations or definitions in the abstract reduce the proposed metrics to fitted parameters or experimental outcomes by construction. The claim that the new metrics align with realized decision utility while conventional ones do not is presented as an empirical finding from benchmarks and case studies, not a definitional tautology. The derivation chain relies on external decision theory rather than self-referential fitting or self-citation chains, making the result self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on standard decision theory and proper scoring rule properties drawn from prior literature; no free parameters or invented physical entities are introduced. The main new constructs are conceptual (decision-alignment test and prior-weighted metrics).

axioms (2)
  • domain assumption Downstream tasks admit well-defined utility functions that can be used to evaluate decisions
    Invoked when defining decision-alignment and the proposed metrics; central to the claim that conventional metrics are misaligned.
  • domain assumption Proper scoring rules can be adapted via prior weighting to produce decision-aligned evaluation
    Basis for proposing the new metric class.
invented entities (2)
  • decision-alignment criterion no independent evidence
    purpose: Test whether an uncertainty metric aligns with downstream decision utility
    New conceptual tool introduced to diagnose misalignment in existing metrics.
  • prior-weighted utility metrics no independent evidence
    purpose: Class of proper scoring rules for decision-aligned UQ evaluation
    Proposed as the solution class that satisfies the alignment criterion.

pith-pipeline@v0.9.1-grok · 5666 in / 1428 out tokens · 18295 ms · 2026-06-26T05:40:21.164873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 27 canonical work pages

  1. [1]

    Sculley, Sebastian Nowozin, Joshua V

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V . Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InProceedings of the 33rd International Conference on Neural Information Processing Systems, volume 32. Curran Associa...

  2. [2]

    URSABench: A system for comprehensive benchmarking of bayesian deep neural net- work models and inference methods

    Meet P Vadera, Jinyang Li, Adam D Cobb, Brian Jalaian, Tarek Abdelzaher, and Benjamin M Marlin. URSABench: A system for comprehensive benchmarking of bayesian deep neural net- work models and inference methods. InProceedings of Machine Learning and Systems, volume 4, pages 217–237, 2022. URL https://proceedings.mlsys.org/paper_files/paper/2022/file/ 4a420...

  3. [3]

    A benchmark on uncertainty quantifi- cation for deep learning prognostics.Reliability Engineering & System Safety, 253:110513, 2025

    Luis Basora, Arthur Viens, Manuel Arias Chao, and Xavier Olive. A benchmark on uncertainty quantifi- cation for deep learning prognostics.Reliability Engineering & System Safety, 253:110513, 2025. URL https://linkinghub.elsevier.com/retrieve/pii/S0951832024005854

  4. [4]

    Vickers and Elena B

    Andrew J. Vickers and Elena B. Elkin. Decision curve analysis: A novel method for evaluating prediction models.Medical Decision Making, 26(6):565–574, 2006. URL https://journals.sagepub.com/ doi/10.1177/0272989X06295361

  5. [5]

    Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M

    Michael W. Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M. Dai. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 204–213. ACM, 2020. URLhttps://dl.acm.org/doi/10.1145/3368555.3384457

  6. [6]

    Evaluating posterior probabilities: Decision theory, proper scoring rules, and calibration.Transactions on Machine Learning Research, 2025

    Luciana Ferrer and Daniel Ramos. Evaluating posterior probabilities: Decision theory, proper scoring rules, and calibration.Transactions on Machine Learning Research, 2025. URL https://openreview. net/forum?id=qbrE0LR7fF

  7. [7]

    Patterns, predictions, and actions: A story about machine learning,

    Moritz Hardt and Benjamin Recht. Patterns, predictions, and actions: A story about machine learning,

  8. [8]

    URLhttp://arxiv.org/abs/2102.05242

  9. [9]

    Zachary Nado, Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Qixuan Feng, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel, Jeremiah Liu, Zelda Mariet, Jeremy Nixon, Shreyas Padhy, Jie Ren, Tim G. J. Rudner, Faris Sbahi, Yeming Wen, Florian Wenzel, Kevin Murphy, D. Sculley, Balaji Lakshminarayanan, Jaspe...

  10. [10]

    Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007. URL http://www.tandfonline. com/doi/abs/10.1198/016214506000001437

  11. [11]

    Measuring, refining and calibrating speaker and language in- formation extracted from speech, 2010

    Niko Brümmer. Measuring, refining and calibrating speaker and language in- formation extracted from speech, 2010. URL https://scispace.com/pdf/ measuring-refining-and-calibrating-speaker-and-language-53glvzuiuj.pdf . PhD Thesis

  12. [12]

    Romain Pic, Clément Dombry, Philippe Naveau, and Maxime Taillardat. Proper scoring rules for multi- variate probabilistic forecasts based on aggregation and transformation.Advances in Statistical Clima- tology, Meteorology and Oceanography, 11(1):23–58, 2025. URL https://ascmo.copernicus.org/ articles/11/23/2025/

  13. [13]

    The neu- ral testbed: Evaluating joint predictions

    Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Brendan O’Donoghue, Botao Hao, Dieterich Lawson, Morteza Ibrahimi, Xiuyuan Lu, Vikranth Dwaracherla, and Benjamin Van Roy. The neu- ral testbed: Evaluating joint predictions. InProceedings of the 36th International Conference on Neural Information Processing Systems, volume 35 ofNIPS ’22, pages 12554–12565. C...

  14. [14]

    Theory and applications of proper scoring rules.METRON, 72(2):169–183, 2014

    Alexander Philip Dawid and Monica Musio. Theory and applications of proper scoring rules.METRON, 72(2):169–183, 2014. URLhttp://link.springer.com/10.1007/s40300-014-0039-y

  15. [15]

    Leonard J. Savage. Elicitation of personal probabilities and expectations.Journal of the American Statistical Association, 66(336):783–801, 1971. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.1971. 10482346. URLhttp://www.tandfonline.com/doi/abs/10.1080/01621459.1971.10482346

  16. [16]

    Schervish

    Mark J. Schervish. A general method for comparing probability assessors.The Annals of Statistics, 17(4): 1856–1879, 1989. URLhttps://www.jstor.org/stable/2241668

  17. [17]

    Loss functions for binary class probability estimation and classification: Structure and applications, 2005

    Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications, 2005. URL http://stat.wharton.upenn.edu/~buja/ PAPERS/paper-proper-scoring.pdf

  18. [18]

    Reid and Robert C

    Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experiments.Journal of Machine Learning Research, 12:731–817, 2011. URL https://jmlr.csail.mit.edu/papers/ v12/reid11a.html

  19. [19]

    Werner Ehm, Tilmann Gneiting, Alexander Jordan, and Fabian Krüger. Of quantiles and expectiles: Consistent scoring functions, choquet representations and forecast rankings.Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3):505–562, 2016. URL https://academic.oup.com/ jrsssb/article/78/3/505/7040984

  20. [20]

    Well-calibrated regression uncertainty in medical imaging with deep learning

    Max-Heinrich Laves, Sontje Ihler, Jacob F Fast, Luder A Kahrs, and Tobias Ortmaier. Well-calibrated regression uncertainty in medical imaging with deep learning. InProceedings of the Third Conference on Medical Imaging with Deep Learning, volume 121, pages 393–412. PMLR, 2020. URL https: //proceedings.mlr.press/v121/laves20a.html

  21. [21]

    Role of calibration in uncertainty- based referral for deep learning.Statistical Methods in Medical Research, 32(5):927–943, 2023

    Ruotao Zhang, Constantine Gatsonis, and Jon Arni Steingrimsson. Role of calibration in uncertainty- based referral for deep learning.Statistical Methods in Medical Research, 32(5):927–943, 2023. URL https://journals.sagepub.com/doi/10.1177/09622802231158811

  22. [22]

    Gruber and Florian Buettner

    Sebastian G. Gruber and Florian Buettner. Better uncertainty calibration via proper scores for classification and beyond. In36th Conference on Neural Information Processing Systems, 2022. URL http://arxiv. org/abs/2203.07835

  23. [23]

    U-calibration: Forecasting for an unknown agent

    Robert Kleinberg, Renato Paes Leme, Jon Schneider, and Yifeng Teng. U-calibration: Forecasting for an unknown agent. InProceedings of Machine Learning Research, volume 195, pages 1–3, 2023. URL https://proceedings.mlr.press/v195/kleinberg23a/kleinberg23a.pdf

  24. [24]

    Calibration error for decision making, 2024

    Lunjia Hu and Yifan Wu. Calibration error for decision making, 2024. URL http://arxiv.org/abs/ 2404.13503

  25. [25]

    Truthfulness of decision-theoretic calibration measures, 2025

    Mingda Qiao and Eric Zhao. Truthfulness of decision-theoretic calibration measures, 2025. URL http://arxiv.org/abs/2503.02384

  26. [26]

    Cost-aware calibration of classifiers.INFORMS Journal on Data Science, 4 (2):101–113, 2025

    Mochen Yang and Xuan Bi. Cost-aware calibration of classifiers.INFORMS Journal on Data Science, 4 (2):101–113, 2025. URLhttps://pubsonline.informs.org/doi/10.1287/ijds.2024.0038

  27. [27]

    Jordan, and Aymeric Dieuleveut

    Mahmoud Hegazy, Michael I. Jordan, and Aymeric Dieuleveut. Scalable utility-aware multiclass calibration,

  28. [28]

    URLhttp://arxiv.org/abs/2510.25458

  29. [29]

    Rank-based decomposable losses in machine learning: A survey

    Shu Hu, Xin Wang, and Siwei Lyu. Rank-based decomposable losses in machine learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2023. URL https: //ieeexplore.ieee.org/document/10184478/

  30. [30]

    The foundations of cost-sensitive learning

    Charles Elkan. The foundations of cost-sensitive learning. InProceedings of the 17th International Joint Conference on Artificial Intelligence, volume 2 ofIJCAI’01, pages 973–978. Morgan Kaufmann Publishers Inc., 2001. URLhttps://dl.acm.org/doi/abs/10.5555/1642194.1642224

  31. [31]

    Quality metrics in recommender systems: Do we calculate metrics consistently? InFifteenth ACM Conference on Recommender Systems, pages 708–713, 2021

    Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. Quality metrics in recommender systems: Do we calculate metrics consistently? InFifteenth ACM Conference on Recommender Systems, pages 708–713, 2021. URLhttp://arxiv.org/abs/2206.12858

  32. [32]

    A no-regret generalization of hierarchical softmax to extreme multi-label classification

    Marek Wydmuch, Kalina Jasinska, Mikhail Kuznetsov, Róbert Busa-Fekete, and Krzysztof Dembczy´nski. A no-regret generalization of hierarchical softmax to extreme multi-label classification. In32nd Conference on Neural Information Processing Systems, 2018. URLhttp://arxiv.org/abs/1810.11671. 11

  33. [33]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 220–229, New York, NY , USA,

  34. [34]

    D., & Gebru, T

    Association for Computing Machinery. URLhttps://doi.org/10.1145/3287560.3287596

  35. [35]

    Day-ahead bidding strategies for wind farm operators under a one-price balancing scheme

    Max Bruninx, Timothy Verstraeten, Jalal Kazempour, and Jan Helsen. Day-ahead bidding strategies for wind farm operators under a one-price balancing scheme. InProceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems, pages 719–726. ACM, 2025. URL https: //dl.acm.org/doi/10.1145/3679240.3734639

  36. [36]

    Pitfalls of epistemic uncertainty quantifica- tion through loss minimisation

    Viktor Bengs, Eyke Hüllermeier, and Willem Waegeman. Pitfalls of epistemic uncertainty quantifica- tion through loss minimisation. InAdvances in Neural Information Processing Systems, volume 35, pages 29205–29216. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_ files/paper/2022/file/bc1d640f841f752c689aae20b31198c1-Paper-Conference.pdf

  37. [37]

    On second-order scoring rules for epistemic uncertainty quantification

    Viktor Bengs, Eyke Hüllermeier, and Willem Waegeman. On second-order scoring rules for epistemic uncertainty quantification. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 2078–2091. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/ v202/bengs23a.html

  38. [38]

    Random-set neural networks

    Shireen Kudukkil Manchingal, Muhammad Mubashar, Kaizheng Wang, Keivan K1 Shariatmadar, and Fabio Cuzzolin. Random-set neural networks. InInternational Conference on Learning Representations,

  39. [39]

    URLhttps://openreview.net/forum?id=pdjkikvCch

  40. [40]

    Shireen Kudukkil Manchingal, Andrew Bradley, Julian F. P. Kooij, Keivan K1 Shariatmadar, Neil Yorke- Smith, and Fabio Cuzzolin. Epistemic artificial intelligence is essential for machine learning models to truly’know when they do not know’, 2025. URLhttps://arxiv.org/abs/2505.04950

  41. [41]

    Allen Bradley and Stuart S

    A. Allen Bradley and Stuart S. Schwartz. Summary verification measures and their interpretation for ensemble forecasts.Monthly Weather Review, 139(9):3075–3089, 2011. URL https://journals. ametsoc.org/doi/10.1175/2010MWR3305.1

  42. [42]

    Aligning the evaluation of probabilistic predictions with downstream value, 2025

    Novin Shahroudi, Viacheslav Komisarenko, and Meelis Kull. Aligning the evaluation of probabilistic predictions with downstream value, 2025. URLhttp://arxiv.org/abs/2508.18251

  43. [43]

    Williamson

    Rabanus Derr and Robert C. Williamson. Forecast evaluation and the relationship of regret and calibration,

  44. [44]

    URLhttp://arxiv.org/abs/2401.14483

  45. [45]

    Aligning evaluation with clinical priorities: Calibration, label shift, and error costs

    Gerardo A Flores, Alyssa H Smith, Julia A Fukuyama, and Ashia C Wilson. Aligning evaluation with clinical priorities: Calibration, label shift, and error costs. InThe 39th Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=b7Ka31HvXU

  46. [46]

    Smith, Julia A

    Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A. Fukuyama, and Ashia C. Wilson. A con- sequentialist critique of binary classification evaluation: Theory, practice, and tools, 2026. URL http://arxiv.org/abs/2504.04528

  47. [47]

    Calibrated reliable regression using maximum mean discrepancy

    Peng Cui, Wenbo Hu, and Jun Zhu. Calibrated reliable regression using maximum mean discrepancy. In Proceedings of the 34th International Conference on Neural Information Processing Systems, volume 33 of NIPS ’20, pages 17164–17175. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/c74c4bf0dad9cbae3d80faa054b7d8...

  48. [48]

    Andrey Malinin, Neil Band, Ganshin, Alexander, German Chesnokov, Yarin Gal, Mark J. F. Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Roginskiy, Denis, Mariya Shmatova, Panos Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. In35th Confere...

  49. [49]

    Towards clear expecta- tions for uncertainty estimation, 2022

    Victor Bouvier, Simona Maggio, Alexandre Abraham, and Léo Dreyfus-Schmidt. Towards clear expecta- tions for uncertainty estimation, 2022. URLhttp://arxiv.org/abs/2207.13341

  50. [50]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1321–1330. PMLR, 2017. URLhttps://proceedings.mlr.press/v70/guo17a.html

  51. [51]

    Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, and Arto Klami

    Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, and Arto Klami. Prior knowledge elicitation: The past, present, and future.Bayesian Analysis, 19(4):1129–1161,

  52. [52]

    URLhttps://doi.org/10.1214/23-BA1381. 12

  53. [53]

    Scalable variational gaussian process classification

    James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational gaussian process classification. InProceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38, pages 351–360. PMLR, 2015. URL https://proceedings.mlr.press/v38/ hensman15.html

  54. [54]

    TabPFN: A transformer that solves small tabular classification problems in a second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe 11th International Conference on Learning Representations, 2023. URLhttps://openreview.net/pdf?id=cp5PvcI6w8_

  55. [55]

    Revisiting deep learning models for tabular data

    Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. InAdvances in Neural Information Processing Systems, volume 34, pages 18932– 18943. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/file/9d86d83f925f2149e9edb0ac3b49229c-Paper.pdf

  56. [56]

    Bayan Bruss, and Tom Goldstein

    Gowthami Somepalli, Avi Schwarzschild, Micah Goldblum, C. Bayan Bruss, and Tom Goldstein. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. In NeurIPS 2022 First Table Representation Workshop, 2022. URL https://openreview.net/forum? id=FiyUTAy4sB8

  57. [57]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/ paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf

  58. [58]

    NGBoost: Natural gradient boosting for probabilistic prediction

    Tony Duan, Anand Avati, Daisy Yi Ding, Khanh K Thai, Sanjay Basu, Andrew Ng, and Alejandro Schuler. NGBoost: Natural gradient boosting for probabilistic prediction. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research. PMLR,

  59. [59]

    URLhttps://proceedings.mlr.press/v119/duan20a.html

  60. [60]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011. URLhttps://jmlr.org/papers/v12/pedregosa11a.html

  61. [61]

    Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration

    Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. InAdvances in Neural In- formation Processing Systems, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 27e8e17134dd7083b050476733207ea1-Abstract.html

  62. [62]

    The UCI machine learning repository, 2025

    Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. The UCI machine learning repository, 2025. URLhttps://archive.ics.uci.edu

  63. [63]

    Bank marketing, 2014

    S Moro, P Rita, and P Cortez. Bank marketing, 2014. URLhttps://doi.org/10.24432/C5K306

  64. [64]

    Heart disease, 1989

    Andras Janosi, William Steinbrunn, Matthias Pfisterer, and Robert Detrano. Heart disease, 1989. URL https://doi.org/10.24432/C52P4X

  65. [65]

    Ionosphere, 1989

    V Sigillito, S Wing, L Hutton, and K Baker. Ionosphere, 1989. URL https://doi.org/10.24432/ C5W01B

  66. [66]

    Mushroom, 1981

    Jeff Schlimmer. Mushroom, 1981. URLhttps://doi.org/10.24432/C5959T

  67. [67]

    Connectionist bench (sonar, mines vs

    T Sejnowski and R Gorman. Connectionist bench (sonar, mines vs. rocks), 1988. URL https://doi. org/10.24432/C5T01Q

  68. [68]

    Air quality, 2008

    S Vito. Air quality, 2008. URLhttps://doi.org/10.24432/C59K5F

  69. [69]

    Auto MPG, 1993

    R Quinlan. Auto MPG, 1993. URLhttps://doi.org/10.24432/C5859H

  70. [70]

    Energy efficiency, 2012

    Athanasios Tsanas and Angeliki Xifara. Energy efficiency, 2012. URL https://doi.org/10.24432/ C51307

  71. [71]

    Combined cycle power plant, 2014

    P Tfekci and H Kaya. Combined cycle power plant, 2014. URL https://doi.org/10.24432/C5002N

  72. [72]

    Wine quality, 2009

    P Cortez, A Cerdeira, F Almeida, T Matos, and J Reis. Wine quality, 2009. URL https://doi.org/10. 24432/C56S3T

  73. [73]

    Elia open data platform, 2025

    Elia. Elia open data platform, 2025. URLhttps://opendata.elia.be. 13

  74. [74]

    ENTSO-e transparency platform,

    European Network of Transmission System Operators for Electricity. ENTSO-e transparency platform,

  75. [75]

    URLhttps://transparency.entsoe.eu

  76. [76]

    Example-dependent cost-sensitive logistic regression for credit scoring

    Alejandro Correa Bahnsen, Djamia Aouada, and Bjorn Ottersten. Example-dependent cost-sensitive logistic regression for credit scoring. In13th International Conference on Machine Learning and Applications, pages 263–269. IEEE, 2014. URLhttp://ieeexplore.ieee.org/document/7033125/

  77. [77]

    Credit risk assessment on a private label credit card application, 2009

    PAKDD. Credit risk assessment on a private label credit card application, 2009. URL https://pakdd. org/archive/pakdd2009/front/show/competition.html. Data Mining Competition

  78. [78]

    GiveMeSomeCredit, 2011

    Will Cukierski. GiveMeSomeCredit, 2011. URL https://kaggle.com/competitions/ GiveMeSomeCredit. Credit Fusion

  79. [79]

    costcla, 2016

    Alejandro Correa Bahnsen. costcla, 2016. URL https://github.com/albahnsen/ CostSensitiveClassification

  80. [80]

    Data-driven optimization of peer-to-peer lending portfolios based on the expected value framework.Intelligent Systems in Accounting, Finance and Management, 28(2):119–129, 2021

    Ajay Byanjankar, József Mezei, and Markku Heikkilä. Data-driven optimization of peer-to-peer lending portfolios based on the expected value framework.Intelligent Systems in Accounting, Finance and Management, 28(2):119–129, 2021. URL https://onlinelibrary.wiley.com/doi/10.1002/isaf. 1490

Showing first 80 references.