pith. sign in

arxiv: 2607.01902 · v1 · pith:7TZGZLFInew · submitted 2026-07-02 · 💻 cs.CV · cs.LG

Rethinking Post-Hoc Calibration in Semantic Segmentation

Pith reviewed 2026-07-03 16:03 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords semantic segmentationpost-hoc calibrationconfidence calibrationtranslation invariancedecision preservationcovariate shiftmedical image segmentation
0
0 comments X

The pith

Post-hoc calibration for semantic segmentation improves when made invariant to logit shifts and required to preserve original decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two structural problems that arise when standard post-hoc calibration is applied to dense prediction models. Adding any constant to all logits leaves the softmax probabilities unchanged yet can alter the output of many common calibrators, so two representations of the same distribution produce different calibrated values. Fitting by a likelihood objective can also reorder classes and thereby change the argmax segmentation map even when the model was trained on task-specific metrics such as Dice. The authors therefore define translation-invariant calibrators, construct invariant counterparts of existing methods, and introduce class-conditional affine calibrators that can be constrained to keep the argmax or the ordering intact. Matched experiments on natural-image and medical benchmarks, including under covariate shift, show that the invariant versions raise calibration scores while the decision-preserving versions avoid segmentation degradation.

Core claim

Post-hoc calibration in semantic segmentation suffers from logit-translation dependence, in which an arbitrary additive offset changes the calibrated output even though the predictive distribution is identical, and from a likelihood-versus-metric mismatch that can alter class orderings. Translation-invariant calibrators eliminate the first dependence; class-conditional affine calibrators retain more expressivity than temperature scaling while permitting explicit argmax- or order-preservation constraints. Across benchmarks the translation-invariant variants improve calibration metrics and the decision-preserving variants prevent degradation of the segmentation map while retaining strong calib

What carries the argument

Translation-invariant (TI) calibrators whose outputs are unchanged under logit shifts, together with class-conditional affine calibrators that can be made argmax- or order-preserving.

If this is right

  • TI calibrators yield identical calibrated probabilities for any two logit vectors that encode the same distribution.
  • Decision-preserving calibration leaves the original segmentation map unchanged.
  • Class-conditional affine calibrators allow a tunable trade-off between calibration quality and preservation of decisions.
  • The same patterns hold when models are evaluated under corruption-induced covariate shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same invariance and preservation requirements may be relevant for any dense prediction task whose output is an argmax over per-pixel distributions.
  • Standard practice for selecting calibrators in segmentation pipelines should include explicit checks for translation sensitivity.
  • The design principles could be used to construct hybrid methods that combine post-hoc adjustment with light fine-tuning while still respecting decision constraints.

Load-bearing premise

The observed gains arise primarily from removing translation dependence and enforcing decision preservation rather than from other experimental choices.

What would settle it

A controlled test in which a non-TI calibrator produces identical or better calibration metrics than its TI counterpart on the exact same logit outputs would undermine the claimed benefit of invariance.

Figures

Figures reproduced from arXiv: 2607.01902 by Balint Kovacs (DKFZ), Kim-Celine Kahl (DKFZ), Klaus Maier-Hein (DKFZ), Maximilian R. Rokuss (DKFZ), Philippe Meyer (ICube), Sylvain Faisan (ICube), Tristan Kirscher (ICube), Xavier Coubez.

Figure 1
Figure 1. Figure 1: Challenges in Segmentation Calibration: Evidence from Real-World Data. (1) Trans￾lation invariance. As illustrated in panel (a), logit embeddings that differ only by a constant offset induce identical class probabilities under softmax, yet a calibrator may produce different outputs for these equiva￾lent representations. Panel (b) shows, for two segmenters, spatial maps of the logit free energy, highlightin… view at source ↗
Figure 2
Figure 2. Figure 2: Sample inputs with ground-truth segmentation overlays from Massachusetts Roads, Cityscapes, [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spatial maps of the pooled-logit free energy [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spatial patterns of label flips under MS(z¯) on BraTS and Cityscapes. Top row: BraTS example with (a) MRI slice, (b) ground-truth foreground mask (green), and (c) flipped voxels after calibration (red). Bottom row: Cityscapes example with (d) input image, (e) ground-truth semantic segmentation overlay, and (f) flipped pixels after calibration (red). The Cityscapes view is cropped for visualization. In both… view at source ↗
Figure 5
Figure 5. Figure 5: Calibration set size sensitivity. Effect of calibration set size n on NLL, ECE, BA-ECE, and ACE for Cityscapes and BraTS, reported as ∆ relative to the n = 50 reference configuration used in the main experiments. (∆) relative to the reference configuration used in our main experiments, corresponding to n = 50 samples for both datasets. For computational efficiency, we consider a representative subset of ca… view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of ECE and ACE to binning. We report, for each dataset, ECE with B ∈ {20, 30, 40, 50} and ACE for B ∈ {15, 20, 30, 40}. Error bars show 95% bootstrap confidence intervals for ECE and [min, max] over repeats for ACE. The ranking of methods is stable across bin counts, indicating that our conclusions are not driven by a particular choice of ECE/ACE discretization. most methods. ACE is more sensit… view at source ↗
Figure 7
Figure 7. Figure 7: Cityscapes corruption types. Example from the test set under strong severity corruption: (a) in-distribution (clean) image; (b) Gaussian noise; (c) Gaussian blur; (d) brightness shift; (e) JPEG compression; (f) fog/haze. All variants preserve pixel-level correspondence with the original labels. 1. Ensembles improve segmentation but do not reliably fix calibration. ( [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: reports test-split reliability curves under the same convention as [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
read the original abstract

Reliable confidence estimates are essential in semantic segmentation, especially in safety-critical settings where overconfident errors can mislead downstream decisions. Yet modern segmentation models often remain miscalibrated. Post-hoc calibration offers a practical way to correct confidence estimates without retraining the segmentation model, but its use in dense prediction raises structural issues that are often overlooked. We study two such issues. First, adding a constant to all logits leaves the softmax probabilities unchanged, but several standard calibrators can still depend on this arbitrary offset. As a result, two logit representations encoding the same predictive distribution may yield different calibrated probabilities. We define translation-invariant (TI) calibrators as those whose outputs are unchanged under such shifts, characterize which common calibrators satisfy this property, and construct TI counterparts of shift-sensitive calibrators to isolate the effect of removing representation dependence. Second, post-hoc calibration is typically fitted by minimizing a likelihood-based objective, whereas segmentation models are trained with task-specific metrics such as Dice. This mismatch can cause calibration to alter class orderings and degrade the deployed segmentation map. We study decision-preserving calibration under argmax- and order-preservation constraints. Since enforcing these constraints collapses affine softmax calibrators to temperature scaling, we introduce class-conditional affine calibrators that can be made argmax- or order-preserving while retaining greater expressivity, allowing us to quantify the calibration-segmentation trade-off induced by decision preservation. Across natural-image and medical segmentation benchmarks, and under corruption-based covariate shift, matched comparisons show that TI variants generally improve calibration metrics, while decision-preserving variants prevent segmentation degradation and retain strong calibration performance. These results provide practical design principles for well-defined post-hoc calibration pipelines in semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper identifies two structural issues in post-hoc calibration for semantic segmentation: logit translation dependence (where standard calibrators can yield different outputs for equivalent predictive distributions under arbitrary logit offsets) and likelihood-vs-metric mismatch (where likelihood-based fitting can alter argmax decisions and degrade segmentation maps). It defines translation-invariant (TI) calibrators, characterizes which common ones satisfy the property, constructs TI counterparts, and introduces class-conditional affine calibrators that can be constrained to be argmax- or order-preserving (noting that standard affine softmax calibrators collapse to temperature scaling under these constraints). Matched comparisons on natural-image and medical benchmarks under corruption-based covariate shift show TI variants improve calibration metrics while decision-preserving variants avoid segmentation degradation and retain strong calibration performance, yielding practical design principles for calibration pipelines.

Significance. If the empirical results hold, the work supplies actionable guidance for reliable confidence estimation in dense prediction, particularly in safety-critical domains. The emphasis on representation invariance and decision preservation, combined with the use of matched comparisons and corruption shifts to isolate effects, strengthens the practical contribution; the paper also ships concrete constructions of TI and constrained calibrators that can be directly implemented.

major comments (1)
  1. [Abstract and experimental results] The premise that the two identified structural issues are the primary overlooked problems (and that resolving them yields the reported gains) is load-bearing for the central claim in the abstract; a concrete test would be an ablation in the experiments section that compares the proposed TI and decision-preserving methods against a wider range of existing calibrators that do not explicitly target these issues, to quantify incremental benefit beyond standard baselines.
minor comments (3)
  1. [Methods] Provide explicit mathematical definitions and pseudocode for the TI counterparts and the class-conditional affine calibrators (including how the argmax-/order-preservation constraints are enforced) to support reproducibility.
  2. [Experiments] Report exact dataset names, corruption types, and quantitative effect sizes (with confidence intervals or statistical tests) for the calibration and segmentation metrics in all tables/figures, rather than qualitative statements such as 'generally improve'.
  3. [Results] Clarify whether the reported improvements are consistent across all classes or driven by particular classes in the class-conditional setting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. The suggestion to strengthen the experimental section with broader ablations is well-taken and directly supports the central claims.

read point-by-point responses
  1. Referee: [Abstract and experimental results] The premise that the two identified structural issues are the primary overlooked problems (and that resolving them yields the reported gains) is load-bearing for the central claim in the abstract; a concrete test would be an ablation in the experiments section that compares the proposed TI and decision-preserving methods against a wider range of existing calibrators that do not explicitly target these issues, to quantify incremental benefit beyond standard baselines.

    Authors: We agree that additional comparisons against a wider set of calibrators would further substantiate the incremental value of addressing translation invariance and decision preservation. Our existing matched experiments already isolate these effects against common baselines (temperature scaling, vector scaling, and their TI/decision-preserving variants) under covariate shift, showing consistent gains in calibration without segmentation degradation. To address the referee's point, we will expand the experiments section in the revision to include ablations against additional standard methods (e.g., isotonic regression, histogram binning, and Dirichlet calibration) that do not target the identified structural issues. This will quantify the benefit more comprehensively while preserving the paper's focus on the two structural problems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on matched comparisons

full rationale

The paper identifies two structural issues in post-hoc calibration for segmentation (logit translation dependence and likelihood-vs-metric mismatch), defines TI and decision-preserving variants, and reports empirical gains from matched comparisons on benchmarks. No derivation reduces a claimed result to its inputs by construction, no fitted parameter is renamed as a prediction, and no load-bearing premise collapses to a self-citation chain. The central results are externally falsifiable via the reported experiments rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no concrete free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5880 in / 1018 out tokens · 30480 ms · 2026-07-03T16:03:16.688943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages

  1. [1]

    Optuna: A next-generation hyperparameter optimization framework

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2019

  2. [2]

    The Cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  3. [4]

    L ocal T emperature S caling for probability calibration

    Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. L ocal T emperature S caling for probability calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  4. [5]

    An Introduction to the Bootstrap

    Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1994

  5. [6]

    Deep ensembles: A loss landscape perspective

    Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. In International Conference on Neural Information Processing Systems (NeurIPS), 2020

  6. [7]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp.\ 1321--1330, 2017

  7. [8]

    Benchmarking neural network robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019

  8. [9]

    Jaeger, Simon A

    Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU - Net : a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18 0 (2): 0 203--211, February 2021. ISSN 1548-7105

  9. [10]

    ValUES : A framework for systematic validation of uncertainty estimation in semantic segmentation

    Kim-Celine Kahl, Carsten T L \"u th, Maximilian Zenk, Klaus Maier-Hein, and Paul F Jaeger. ValUES : A framework for systematic validation of uncertainty estimation in semantic segmentation. In International Conference on Learning Representations (ICLR), 2024

  10. [11]

    Benchmarking the robustness of semantic segmentation models

    Christoph Kamann and Carsten Rother. Benchmarking the robustness of semantic segmentation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 8828--8838, 2020

  11. [12]

    Uncertainty calibration with energy based instance-wise scaling in the wild dataset

    Mijoo Kim and Junseok Kwon. Uncertainty calibration with energy based instance-wise scaling in the wild dataset. In Eur. Conf. Comput. Vis. (ECCV), 2024

  12. [13]

    Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with D irichlet calibration

    Meelis Kull, Miquel Perello-Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with D irichlet calibration. In Advances in Neural Information Processing Systems (NeurIPS), 2019

  13. [14]

    Trainable calibration measures for neural networks from kernel mean embeddings

    Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2805--2814. PMLR, 2018

  14. [15]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  15. [16]

    We care each pixel: Calibrating medical segmentation models

    Wenhao Liang, Wei Zhang, Lin Yue, Miao Xu, Olaf Maennel, and Weitong Chen. We care each pixel: Calibrating medical segmentation models. In Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025

  16. [17]

    Energy-based out-of-distribution detection

    Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 21464--21475, 2020

  17. [18]

    Wells, Clare M

    Alireza Mehrtash, William M. Wells, Clare M. Tempany, Purang Abolmaesumi, and Tina Kapur. Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging (T-MI), 39 0 (12): 0 3868--3878, 2020

  18. [19]

    D irichlet-based gaussian processes for large-scale calibrated classification

    Dimitrios Milios, Raffaello Camoriano, Pietro Michiardi, Lorenzo Rosasco, and Maurizio Filippone. D irichlet-based gaussian processes for large-scale calibrated classification. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

  19. [20]

    Machine Learning for Aerial Image Labeling

    Volodymyr Mnih. Machine Learning for Aerial Image Labeling. PhD thesis, University of Toronto, 2013

  20. [21]

    Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H. S. Torr, and Puneet K. Dokania. Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems, volume 33, pp.\ 15288--15299, 2020

  21. [22]

    When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, 2019

    Rafael M \"u ller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, 2019

  22. [23]

    Sculley, Sebastian Nowozin, Joshua V

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019

  23. [24]

    Obtaining well calibrated probabilities using B ayesian binning

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using B ayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29 0 (1), Feb. 2015

  24. [25]

    PyTorch : An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, and Francisco Massa. PyTorch : An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019

  25. [26]

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

    John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10 0 (3): 0 61--74, 1999

  26. [27]

    Rahul Rahaman and Alexandre H. Thiery. Uncertainty quantification and deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  27. [28]

    Intra order-preserving functions for calibration of multi-class neural networks

    Amir Rahimi, Amirreza Shaban, Ching-An Cheng, Richard Hartley, and Byron Boots. Intra order-preserving functions for calibration of multi-class neural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 13456--13467, 2020

  28. [29]

    Post hoc calibration of medical segmentation models

    Axel-Jan Rousseau, Thijs Becker, Simon Appeltans, Matthew Blaschko, and Dirk Valkenborg. Post hoc calibration of medical segmentation models. Discover Applied Sciences, 7 0 (3): 0 180, 2025

  29. [30]

    Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations

    Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In International Workshop on Deep Learning in Medical Image Analysis, pp.\ 240--248. Springer, 2017

  30. [31]

    Oliphant

    Pauli Virtanen, Ralf Gommers, and Travis E. Oliphant. SciPy 1.0: Fundamental algorithms for scientific computing in P ython. Nature Methods, 17: 0 261--272, 2020

  31. [32]

    On calibrating semantic segmentation models: Analyses and an algorithm

    Dongdong Wang, Boqing Gong, and Liqiang Wang. On calibrating semantic segmentation models: Analyses and an algorithm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  32. [33]

    Non-parametric calibration for classification

    Jonathan Wenger, Hedvig Kjellstr \"o m, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pp.\ 178--190. PMLR, 2020

  33. [34]

    Learning and making decisions when costs and probabilities are both unknown

    Bianca Zadrozny and Charles Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp.\ 204--213, 2001

  34. [35]

    Transforming classifier scores into accurate multiclass probability estimates

    Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02, pp.\ 694–699, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 158113567X

  35. [36]

    Lieffrig, Lawrence H

    Tal Zeevi, El \'e onore V. Lieffrig, Lawrence H. Staib, and John A. Onofrey. Spatially-aware evaluation of segmentation uncertainty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2025

  36. [37]

    Jäger, and Klaus Maier-Hein

    Maximilian Zenk, David Zimmerer, Fabian Isensee, Jeremias Traub, Tobias Norajitra, Paul F. Jäger, and Klaus Maier-Hein. Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation. Medical Image Analysis, 101: 0 103392, 2025. ISSN 1361-8415

  37. [38]

    Mix-n-Match : Ensemble and compositional methods for uncertainty calibration in deep learning

    Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. Mix-n-Match : Ensemble and compositional methods for uncertainty calibration in deep learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp.\ 11117--11128. PMLR, 2020

  38. [39]

    FirstName Alpher , title =

  39. [40]

    Journal of Foo , volume = 13, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

  40. [41]

    Journal of Foo , volume = 14, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

  41. [42]

    FirstName Alpher and FirstName Gamow , title =

  42. [43]

    Computer Vision -- ECCV 2022 , year =

  43. [44]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  44. [45]

    International Conference on Neural Information Processing Systems (NeurIPS) , year=

    Deep Ensembles: A Loss Landscape Perspective , author=. International Conference on Neural Information Processing Systems (NeurIPS) , year=

  45. [46]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty under Dataset Shift , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  46. [47]

    Proceedings of the 34th International Conference on Machine Learning (ICML) , year=

    On Calibration of Modern Neural Networks , author=. Proceedings of the 34th International Conference on Machine Learning (ICML) , year=

  47. [48]

    Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with

    Kull, Meelis and Perello-Nieto, Miquel and Kängsepp, Markus and Silva Filho, Telmo and Song, Hao and Flach, Peter , booktitle=. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with

  48. [49]

    Ding, Zhipeng and Han, Xu and Liu, Peirong and Niethammer, Marc , booktitle=

  49. [50]

    IEEE Transactions on Medical Imaging (T-MI) , volume=

    Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation , author=. IEEE Transactions on Medical Imaging (T-MI) , volume=

  50. [51]

    Discover Applied Sciences , volume=

    Post hoc calibration of medical segmentation models , author=. Discover Applied Sciences , volume=

  51. [52]

    Medical Image Analysis , volume=

    Neighbor-aware calibration of segmentation networks with penalty-based constraints , author=. Medical Image Analysis , volume=

  52. [53]

    International Conference on Learning Representations (ICLR) , year=

    Pitfalls of In-domain Uncertainty Estimation and Ensembling in Deep Learning , author=. International Conference on Learning Representations (ICLR) , year=

  53. [54]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  54. [55]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , year=

    Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? , author=. Proceedings of the 37th International Conference on Machine Learning (ICML) , year=

  55. [56]

    Medical Image Analysis , year=

    A Review of Uncertainty Quantification in Medical Image Analysis: Probabilistic and Non-Probabilistic Methods , author=. Medical Image Analysis , year=

  56. [57]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    On Calibrating Semantic Segmentation Models: Analyses and an Algorithm , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  57. [58]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Uncertainty Quantification and Deep Ensembles , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  58. [59]

    Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) , year=

    We Care Each Pixel: Calibrating Medical Segmentation Models , author=. Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) , year=

  59. [60]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

    Spatially-Aware Evaluation of Segmentation Uncertainty , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

  60. [61]

    The 2024

    Correia de Verdier, Maria and Saluja, Rachit and Gagnon, Louis and LaBella, Dominic and Baid, Ujjwall and Tahon, Nourel Hoda and Foltyn-Dumitru, Martha and Zhang, Jikai and Alafif, Maram and Baig, Saif and Chang, Ken and D'Anna, Gennaro and Deptula, Lisa and Gupta, Diviya and Haider, Muhammad Ammar and Hussain, Ali and Iv, Michael and Kontzialis, Marinos ...

  61. [62]

    and Kushibar, Kaisar and Martín-Isla, Carlos and Radeva, Petia and Lekadir, Karim and Barfoot, Theodore and Garcia Peraza Herrera, Luis C

    Riera-Marín, Meritxell and O.K., Sikha and Rodríguez-Comas, Júlia and May, Matthias Stefan and Pan, Zhaohong and Zhou, Xiang and Liang, Xiaokun and Erick, Franciskus Xaverius and Prenner, Andrea and Hémon, Cédric and Boussot, Valentin and Dillenseger, Jean-Louis and Nunes, Jean-Claude and Qayyum, Abdul and Mazher, Moona and Niederer, Steven A. and Kushiba...

  62. [63]

    Mnih, Volodymyr , title =

  63. [64]

    2025 , eprint=

    Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation , author=. 2025 , eprint=

  64. [65]

    2021 , pages =

    Nature Methods , author =. 2021 , pages =

  65. [66]

    Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers , volume =

    Zadrozny, Bianca and Elkan, Charles , year =. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers , volume =

  66. [67]

    Verified Uncertainty Calibration , volume =

    Kumar, Ananya and Liang, Percy S and Ma, Tengyu , booktitle =. Verified Uncertainty Calibration , volume =

  67. [68]

    Cordts, Marius and Omran, Mohamed and Ramos, Sebastian and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and Schiele, Bernt , booktitle=. The

  68. [69]

    Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset , author=. Eur. Conf. Comput. Vis. (ECCV) , year=

  69. [70]

    2021 , eprint=

    Should Ensemble Members Be Calibrated? , author=. 2021 , eprint=

  70. [71]

    Transactions on Machine Learning Research (TMLR) , issn=

    On Joint Regularization and Calibration in Deep Ensembles , author=. Transactions on Machine Learning Research (TMLR) , issn=. 2025 , note=

  71. [72]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Deep ensembles work, but are they necessary? , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  72. [73]

    , author=

    The impact of averaging logits over probabilities on ensembles of neural networks. , author=. AISafety@ IJCAI , pages=

  73. [74]

    Transactions on Machine Learning Research (TMLR) , year=

    Where are we with calibration under dataset shift in image classification? , author=. Transactions on Machine Learning Research (TMLR) , year=

  74. [75]

    Proceedings of the 25th

    Optuna: A Next-generation Hyperparameter Optimization Framework , author=. Proceedings of the 25th

  75. [76]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  76. [77]

    2020 , organization=

    Zhang, Jize and Kailkhura, Bhavya and Han, T Yong-Jin , booktitle=. 2020 , organization=

  77. [78]

    International Conference on Artificial Intelligence and Statistics , pages=

    Non-parametric calibration for classification , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

  78. [79]

    Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

    Zadrozny, Bianca and Elkan, Charles , title =. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2002 , isbn =

  79. [80]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Obtaining Well Calibrated Probabilities Using. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2015 , month=

  80. [81]

    Advances in large margin classifiers , volume=

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods , author=. Advances in large margin classifiers , volume=. 1999 , publisher=

Showing first 80 references.