Rethinking Post-Hoc Calibration in Semantic Segmentation

Balint Kovacs (DKFZ); Kim-Celine Kahl (DKFZ); Klaus Maier-Hein (DKFZ); Maximilian R. Rokuss (DKFZ); Philippe Meyer (ICube); Sylvain Faisan (ICube); Tristan Kirscher (ICube); Xavier Coubez

arxiv: 2607.01902 · v1 · pith:7TZGZLFInew · submitted 2026-07-02 · 💻 cs.CV · cs.LG

Rethinking Post-Hoc Calibration in Semantic Segmentation

Tristan Kirscher (ICube) , Kim-Celine Kahl (DKFZ) , Balint Kovacs (DKFZ) , Maximilian R. Rokuss (DKFZ) , Klaus Maier-Hein (DKFZ) , Xavier Coubez , Philippe Meyer (ICube) , Sylvain Faisan (ICube) This is my paper

Pith reviewed 2026-07-03 16:03 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords semantic segmentationpost-hoc calibrationconfidence calibrationtranslation invariancedecision preservationcovariate shiftmedical image segmentation

0 comments

The pith

Post-hoc calibration for semantic segmentation improves when made invariant to logit shifts and required to preserve original decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two structural problems that arise when standard post-hoc calibration is applied to dense prediction models. Adding any constant to all logits leaves the softmax probabilities unchanged yet can alter the output of many common calibrators, so two representations of the same distribution produce different calibrated values. Fitting by a likelihood objective can also reorder classes and thereby change the argmax segmentation map even when the model was trained on task-specific metrics such as Dice. The authors therefore define translation-invariant calibrators, construct invariant counterparts of existing methods, and introduce class-conditional affine calibrators that can be constrained to keep the argmax or the ordering intact. Matched experiments on natural-image and medical benchmarks, including under covariate shift, show that the invariant versions raise calibration scores while the decision-preserving versions avoid segmentation degradation.

Core claim

Post-hoc calibration in semantic segmentation suffers from logit-translation dependence, in which an arbitrary additive offset changes the calibrated output even though the predictive distribution is identical, and from a likelihood-versus-metric mismatch that can alter class orderings. Translation-invariant calibrators eliminate the first dependence; class-conditional affine calibrators retain more expressivity than temperature scaling while permitting explicit argmax- or order-preservation constraints. Across benchmarks the translation-invariant variants improve calibration metrics and the decision-preserving variants prevent degradation of the segmentation map while retaining strong calib

What carries the argument

Translation-invariant (TI) calibrators whose outputs are unchanged under logit shifts, together with class-conditional affine calibrators that can be made argmax- or order-preserving.

If this is right

TI calibrators yield identical calibrated probabilities for any two logit vectors that encode the same distribution.
Decision-preserving calibration leaves the original segmentation map unchanged.
Class-conditional affine calibrators allow a tunable trade-off between calibration quality and preservation of decisions.
The same patterns hold when models are evaluated under corruption-induced covariate shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same invariance and preservation requirements may be relevant for any dense prediction task whose output is an argmax over per-pixel distributions.
Standard practice for selecting calibrators in segmentation pipelines should include explicit checks for translation sensitivity.
The design principles could be used to construct hybrid methods that combine post-hoc adjustment with light fine-tuning while still respecting decision constraints.

Load-bearing premise

The observed gains arise primarily from removing translation dependence and enforcing decision preservation rather than from other experimental choices.

What would settle it

A controlled test in which a non-TI calibrator produces identical or better calibration metrics than its TI counterpart on the exact same logit outputs would undermine the claimed benefit of invariance.

Figures

Figures reproduced from arXiv: 2607.01902 by Balint Kovacs (DKFZ), Kim-Celine Kahl (DKFZ), Klaus Maier-Hein (DKFZ), Maximilian R. Rokuss (DKFZ), Philippe Meyer (ICube), Sylvain Faisan (ICube), Tristan Kirscher (ICube), Xavier Coubez.

**Figure 1.** Figure 1: Challenges in Segmentation Calibration: Evidence from Real-World Data. (1) Translation invariance. As illustrated in panel (a), logit embeddings that differ only by a constant offset induce identical class probabilities under softmax, yet a calibrator may produce different outputs for these equivalent representations. Panel (b) shows, for two segmenters, spatial maps of the logit free energy, highlightin… view at source ↗

**Figure 2.** Figure 2: Sample inputs with ground-truth segmentation overlays from Massachusetts Roads, Cityscapes, [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Spatial maps of the pooled-logit free energy [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Spatial patterns of label flips under MS(z¯) on BraTS and Cityscapes. Top row: BraTS example with (a) MRI slice, (b) ground-truth foreground mask (green), and (c) flipped voxels after calibration (red). Bottom row: Cityscapes example with (d) input image, (e) ground-truth semantic segmentation overlay, and (f) flipped pixels after calibration (red). The Cityscapes view is cropped for visualization. In both… view at source ↗

**Figure 5.** Figure 5: Calibration set size sensitivity. Effect of calibration set size n on NLL, ECE, BA-ECE, and ACE for Cityscapes and BraTS, reported as ∆ relative to the n = 50 reference configuration used in the main experiments. (∆) relative to the reference configuration used in our main experiments, corresponding to n = 50 samples for both datasets. For computational efficiency, we consider a representative subset of ca… view at source ↗

**Figure 6.** Figure 6: Sensitivity of ECE and ACE to binning. We report, for each dataset, ECE with B ∈ {20, 30, 40, 50} and ACE for B ∈ {15, 20, 30, 40}. Error bars show 95% bootstrap confidence intervals for ECE and [min, max] over repeats for ACE. The ranking of methods is stable across bin counts, indicating that our conclusions are not driven by a particular choice of ECE/ACE discretization. most methods. ACE is more sensit… view at source ↗

**Figure 7.** Figure 7: Cityscapes corruption types. Example from the test set under strong severity corruption: (a) in-distribution (clean) image; (b) Gaussian noise; (c) Gaussian blur; (d) brightness shift; (e) JPEG compression; (f) fog/haze. All variants preserve pixel-level correspondence with the original labels. 1. Ensembles improve segmentation but do not reliably fix calibration. ( [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: reports test-split reliability curves under the same convention as [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

read the original abstract

Reliable confidence estimates are essential in semantic segmentation, especially in safety-critical settings where overconfident errors can mislead downstream decisions. Yet modern segmentation models often remain miscalibrated. Post-hoc calibration offers a practical way to correct confidence estimates without retraining the segmentation model, but its use in dense prediction raises structural issues that are often overlooked. We study two such issues. First, adding a constant to all logits leaves the softmax probabilities unchanged, but several standard calibrators can still depend on this arbitrary offset. As a result, two logit representations encoding the same predictive distribution may yield different calibrated probabilities. We define translation-invariant (TI) calibrators as those whose outputs are unchanged under such shifts, characterize which common calibrators satisfy this property, and construct TI counterparts of shift-sensitive calibrators to isolate the effect of removing representation dependence. Second, post-hoc calibration is typically fitted by minimizing a likelihood-based objective, whereas segmentation models are trained with task-specific metrics such as Dice. This mismatch can cause calibration to alter class orderings and degrade the deployed segmentation map. We study decision-preserving calibration under argmax- and order-preservation constraints. Since enforcing these constraints collapses affine softmax calibrators to temperature scaling, we introduce class-conditional affine calibrators that can be made argmax- or order-preserving while retaining greater expressivity, allowing us to quantify the calibration-segmentation trade-off induced by decision preservation. Across natural-image and medical segmentation benchmarks, and under corruption-based covariate shift, matched comparisons show that TI variants generally improve calibration metrics, while decision-preserving variants prevent segmentation degradation and retain strong calibration performance. These results provide practical design principles for well-defined post-hoc calibration pipelines in semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly identifies logit translation dependence and decision-altering calibration as overlooked issues in segmentation, then builds TI and class-conditional affine fixes that improve metrics without hurting the output map.

read the letter

The main things to know are that common calibrators can change when you add a constant to all logits even though the softmax stays the same, and that fitting them on likelihood can reorder classes and degrade the segmentation. The authors define translation-invariant calibrators, characterize which standard ones already are, and build TI versions of the rest. They also replace temperature scaling with class-conditional affine forms that can be constrained to keep argmax or ordering intact while staying more expressive.

The matched experiments on natural-image and medical benchmarks, plus corruption shifts, show TI versions lift calibration scores and the decision-preserving versions avoid segmentation drops while still calibrating well. That separation of concerns is the useful part.

The evidence for the two structural problems is direct and the constructions follow from the constraints they set, so the central claims hold up on the reported comparisons. A minor open question is how sensitive the gains are to the exact choice of base calibrator or to stronger distribution shifts, but the paper does not overclaim on that.

This is for people who deploy segmentation in medical or safety settings and need both reliable probabilities and unchanged decisions. It is worth sending to a serious referee because the fixes are concrete, the comparisons are matched, and the design principles are stated plainly.

Referee Report

1 major / 3 minor

Summary. The paper identifies two structural issues in post-hoc calibration for semantic segmentation: logit translation dependence (where standard calibrators can yield different outputs for equivalent predictive distributions under arbitrary logit offsets) and likelihood-vs-metric mismatch (where likelihood-based fitting can alter argmax decisions and degrade segmentation maps). It defines translation-invariant (TI) calibrators, characterizes which common ones satisfy the property, constructs TI counterparts, and introduces class-conditional affine calibrators that can be constrained to be argmax- or order-preserving (noting that standard affine softmax calibrators collapse to temperature scaling under these constraints). Matched comparisons on natural-image and medical benchmarks under corruption-based covariate shift show TI variants improve calibration metrics while decision-preserving variants avoid segmentation degradation and retain strong calibration performance, yielding practical design principles for calibration pipelines.

Significance. If the empirical results hold, the work supplies actionable guidance for reliable confidence estimation in dense prediction, particularly in safety-critical domains. The emphasis on representation invariance and decision preservation, combined with the use of matched comparisons and corruption shifts to isolate effects, strengthens the practical contribution; the paper also ships concrete constructions of TI and constrained calibrators that can be directly implemented.

major comments (1)

[Abstract and experimental results] The premise that the two identified structural issues are the primary overlooked problems (and that resolving them yields the reported gains) is load-bearing for the central claim in the abstract; a concrete test would be an ablation in the experiments section that compares the proposed TI and decision-preserving methods against a wider range of existing calibrators that do not explicitly target these issues, to quantify incremental benefit beyond standard baselines.

minor comments (3)

[Methods] Provide explicit mathematical definitions and pseudocode for the TI counterparts and the class-conditional affine calibrators (including how the argmax-/order-preservation constraints are enforced) to support reproducibility.
[Experiments] Report exact dataset names, corruption types, and quantitative effect sizes (with confidence intervals or statistical tests) for the calibration and segmentation metrics in all tables/figures, rather than qualitative statements such as 'generally improve'.
[Results] Clarify whether the reported improvements are consistent across all classes or driven by particular classes in the class-conditional setting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. The suggestion to strengthen the experimental section with broader ablations is well-taken and directly supports the central claims.

read point-by-point responses

Referee: [Abstract and experimental results] The premise that the two identified structural issues are the primary overlooked problems (and that resolving them yields the reported gains) is load-bearing for the central claim in the abstract; a concrete test would be an ablation in the experiments section that compares the proposed TI and decision-preserving methods against a wider range of existing calibrators that do not explicitly target these issues, to quantify incremental benefit beyond standard baselines.

Authors: We agree that additional comparisons against a wider set of calibrators would further substantiate the incremental value of addressing translation invariance and decision preservation. Our existing matched experiments already isolate these effects against common baselines (temperature scaling, vector scaling, and their TI/decision-preserving variants) under covariate shift, showing consistent gains in calibration without segmentation degradation. To address the referee's point, we will expand the experiments section in the revision to include ablations against additional standard methods (e.g., isotonic regression, histogram binning, and Dirichlet calibration) that do not target the identified structural issues. This will quantify the benefit more comprehensively while preserving the paper's focus on the two structural problems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on matched comparisons

full rationale

The paper identifies two structural issues in post-hoc calibration for segmentation (logit translation dependence and likelihood-vs-metric mismatch), defines TI and decision-preserving variants, and reports empirical gains from matched comparisons on benchmarks. No derivation reduces a claimed result to its inputs by construction, no fitted parameter is renamed as a prediction, and no load-bearing premise collapses to a self-citation chain. The central results are externally falsifiable via the reported experiments rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no concrete free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5880 in / 1018 out tokens · 30480 ms · 2026-07-03T16:03:16.688943+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages

[1]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2019

work page 2019
[2]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[4]

L ocal T emperature S caling for probability calibration

Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. L ocal T emperature S caling for probability calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[5]

An Introduction to the Bootstrap

Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1994

work page 1994
[6]

Deep ensembles: A loss landscape perspective

Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. In International Conference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[7]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp.\ 1321--1330, 2017

work page 2017
[8]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019

work page 2019
[9]

Jaeger, Simon A

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU - Net : a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18 0 (2): 0 203--211, February 2021. ISSN 1548-7105

work page 2021
[10]

ValUES : A framework for systematic validation of uncertainty estimation in semantic segmentation

Kim-Celine Kahl, Carsten T L \"u th, Maximilian Zenk, Klaus Maier-Hein, and Paul F Jaeger. ValUES : A framework for systematic validation of uncertainty estimation in semantic segmentation. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[11]

Benchmarking the robustness of semantic segmentation models

Christoph Kamann and Carsten Rother. Benchmarking the robustness of semantic segmentation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 8828--8838, 2020

work page 2020
[12]

Uncertainty calibration with energy based instance-wise scaling in the wild dataset

Mijoo Kim and Junseok Kwon. Uncertainty calibration with energy based instance-wise scaling in the wild dataset. In Eur. Conf. Comput. Vis. (ECCV), 2024

work page 2024
[13]

Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with D irichlet calibration

Meelis Kull, Miquel Perello-Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with D irichlet calibration. In Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[14]

Trainable calibration measures for neural networks from kernel mean embeddings

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2805--2814. PMLR, 2018

work page 2018
[15]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[16]

We care each pixel: Calibrating medical segmentation models

Wenhao Liang, Wei Zhang, Lin Yue, Miao Xu, Olaf Maennel, and Weitong Chen. We care each pixel: Calibrating medical segmentation models. In Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025

work page 2025
[17]

Energy-based out-of-distribution detection

Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 21464--21475, 2020

work page 2020
[18]

Wells, Clare M

Alireza Mehrtash, William M. Wells, Clare M. Tempany, Purang Abolmaesumi, and Tina Kapur. Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging (T-MI), 39 0 (12): 0 3868--3878, 2020

work page 2020
[19]

D irichlet-based gaussian processes for large-scale calibrated classification

Dimitrios Milios, Raffaello Camoriano, Pietro Michiardi, Lorenzo Rosasco, and Maurizio Filippone. D irichlet-based gaussian processes for large-scale calibrated classification. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

work page 2018
[20]

Machine Learning for Aerial Image Labeling

Volodymyr Mnih. Machine Learning for Aerial Image Labeling. PhD thesis, University of Toronto, 2013

work page 2013
[21]

Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H. S. Torr, and Puneet K. Dokania. Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems, volume 33, pp.\ 15288--15299, 2020

work page 2020
[22]

When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, 2019

Rafael M \"u ller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[23]

Sculley, Sebastian Nowozin, Joshua V

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[24]

Obtaining well calibrated probabilities using B ayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using B ayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29 0 (1), Feb. 2015

work page 2015
[25]

PyTorch : An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, and Francisco Massa. PyTorch : An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[26]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10 0 (3): 0 61--74, 1999

work page 1999
[27]

Rahul Rahaman and Alexandre H. Thiery. Uncertainty quantification and deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[28]

Intra order-preserving functions for calibration of multi-class neural networks

Amir Rahimi, Amirreza Shaban, Ching-An Cheng, Richard Hartley, and Byron Boots. Intra order-preserving functions for calibration of multi-class neural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 13456--13467, 2020

work page 2020
[29]

Post hoc calibration of medical segmentation models

Axel-Jan Rousseau, Thijs Becker, Simon Appeltans, Matthew Blaschko, and Dirk Valkenborg. Post hoc calibration of medical segmentation models. Discover Applied Sciences, 7 0 (3): 0 180, 2025

work page 2025
[30]

Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations

Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In International Workshop on Deep Learning in Medical Image Analysis, pp.\ 240--248. Springer, 2017

work page 2017
[31]

Oliphant

Pauli Virtanen, Ralf Gommers, and Travis E. Oliphant. SciPy 1.0: Fundamental algorithms for scientific computing in P ython. Nature Methods, 17: 0 261--272, 2020

work page 2020
[32]

On calibrating semantic segmentation models: Analyses and an algorithm

Dongdong Wang, Boqing Gong, and Liqiang Wang. On calibrating semantic segmentation models: Analyses and an algorithm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[33]

Non-parametric calibration for classification

Jonathan Wenger, Hedvig Kjellstr \"o m, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pp.\ 178--190. PMLR, 2020

work page 2020
[34]

Learning and making decisions when costs and probabilities are both unknown

Bianca Zadrozny and Charles Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp.\ 204--213, 2001

work page 2001
[35]

Transforming classifier scores into accurate multiclass probability estimates

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02, pp.\ 694–699, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 158113567X

work page 2002
[36]

Lieffrig, Lawrence H

Tal Zeevi, El \'e onore V. Lieffrig, Lawrence H. Staib, and John A. Onofrey. Spatially-aware evaluation of segmentation uncertainty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2025

work page 2025
[37]

Jäger, and Klaus Maier-Hein

Maximilian Zenk, David Zimmerer, Fabian Isensee, Jeremias Traub, Tobias Norajitra, Paul F. Jäger, and Klaus Maier-Hein. Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation. Medical Image Analysis, 101: 0 103392, 2025. ISSN 1361-8415

work page 2025
[38]

Mix-n-Match : Ensemble and compositional methods for uncertainty calibration in deep learning

Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. Mix-n-Match : Ensemble and compositional methods for uncertainty calibration in deep learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp.\ 11117--11128. PMLR, 2020

work page 2020
[39]

FirstName Alpher , title =

work page
[40]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

work page
[41]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

work page
[42]

FirstName Alpher and FirstName Gamow , title =

work page
[43]

Computer Vision -- ECCV 2022 , year =

work page 2022
[44]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[45]

International Conference on Neural Information Processing Systems (NeurIPS) , year=

Deep Ensembles: A Loss Landscape Perspective , author=. International Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[46]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty under Dataset Shift , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[47]

Proceedings of the 34th International Conference on Machine Learning (ICML) , year=

On Calibration of Modern Neural Networks , author=. Proceedings of the 34th International Conference on Machine Learning (ICML) , year=

work page
[48]

Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with

Kull, Meelis and Perello-Nieto, Miquel and Kängsepp, Markus and Silva Filho, Telmo and Song, Hao and Flach, Peter , booktitle=. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with

work page
[49]

Ding, Zhipeng and Han, Xu and Liu, Peirong and Niethammer, Marc , booktitle=

work page
[50]

IEEE Transactions on Medical Imaging (T-MI) , volume=

Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation , author=. IEEE Transactions on Medical Imaging (T-MI) , volume=

work page
[51]

Discover Applied Sciences , volume=

Post hoc calibration of medical segmentation models , author=. Discover Applied Sciences , volume=

work page
[52]

Medical Image Analysis , volume=

Neighbor-aware calibration of segmentation networks with penalty-based constraints , author=. Medical Image Analysis , volume=

work page
[53]

International Conference on Learning Representations (ICLR) , year=

Pitfalls of In-domain Uncertainty Estimation and Ensembling in Deep Learning , author=. International Conference on Learning Representations (ICLR) , year=

work page
[54]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[55]

Proceedings of the 37th International Conference on Machine Learning (ICML) , year=

Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? , author=. Proceedings of the 37th International Conference on Machine Learning (ICML) , year=

work page
[56]

Medical Image Analysis , year=

A Review of Uncertainty Quantification in Medical Image Analysis: Probabilistic and Non-Probabilistic Methods , author=. Medical Image Analysis , year=

work page
[57]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

On Calibrating Semantic Segmentation Models: Analyses and an Algorithm , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[58]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Uncertainty Quantification and Deep Ensembles , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[59]

Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) , year=

We Care Each Pixel: Calibrating Medical Segmentation Models , author=. Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) , year=

work page
[60]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

Spatially-Aware Evaluation of Segmentation Uncertainty , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

work page
[61]

The 2024

Correia de Verdier, Maria and Saluja, Rachit and Gagnon, Louis and LaBella, Dominic and Baid, Ujjwall and Tahon, Nourel Hoda and Foltyn-Dumitru, Martha and Zhang, Jikai and Alafif, Maram and Baig, Saif and Chang, Ken and D'Anna, Gennaro and Deptula, Lisa and Gupta, Diviya and Haider, Muhammad Ammar and Hussain, Ali and Iv, Michael and Kontzialis, Marinos ...

work page arXiv 2024
[62]

and Kushibar, Kaisar and Martín-Isla, Carlos and Radeva, Petia and Lekadir, Karim and Barfoot, Theodore and Garcia Peraza Herrera, Luis C

Riera-Marín, Meritxell and O.K., Sikha and Rodríguez-Comas, Júlia and May, Matthias Stefan and Pan, Zhaohong and Zhou, Xiang and Liang, Xiaokun and Erick, Franciskus Xaverius and Prenner, Andrea and Hémon, Cédric and Boussot, Valentin and Dillenseger, Jean-Louis and Nunes, Jean-Claude and Qayyum, Abdul and Mazher, Moona and Niederer, Steven A. and Kushiba...

work page doi:10.1016/j.compbiomed.2025.111024 2025
[63]

Mnih, Volodymyr , title =

work page
[64]

2025 , eprint=

Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation , author=. 2025 , eprint=

work page 2025
[65]

2021 , pages =

Nature Methods , author =. 2021 , pages =

work page 2021
[66]

Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers , volume =

Zadrozny, Bianca and Elkan, Charles , year =. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers , volume =

work page
[67]

Verified Uncertainty Calibration , volume =

Kumar, Ananya and Liang, Percy S and Ma, Tengyu , booktitle =. Verified Uncertainty Calibration , volume =

work page
[68]

Cordts, Marius and Omran, Mohamed and Ramos, Sebastian and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and Schiele, Bernt , booktitle=. The

work page
[69]

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset , author=. Eur. Conf. Comput. Vis. (ECCV) , year=

work page
[70]

2021 , eprint=

Should Ensemble Members Be Calibrated? , author=. 2021 , eprint=

work page 2021
[71]

Transactions on Machine Learning Research (TMLR) , issn=

On Joint Regularization and Calibration in Deep Ensembles , author=. Transactions on Machine Learning Research (TMLR) , issn=. 2025 , note=

work page 2025
[72]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Deep ensembles work, but are they necessary? , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[73]

, author=

The impact of averaging logits over probabilities on ensembles of neural networks. , author=. AISafety@ IJCAI , pages=

work page
[74]

Transactions on Machine Learning Research (TMLR) , year=

Where are we with calibration under dataset shift in image classification? , author=. Transactions on Machine Learning Research (TMLR) , year=

work page
[75]

Proceedings of the 25th

Optuna: A Next-generation Hyperparameter Optimization Framework , author=. Proceedings of the 25th

work page
[76]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[77]

2020 , organization=

Zhang, Jize and Kailkhura, Bhavya and Han, T Yong-Jin , booktitle=. 2020 , organization=

work page 2020
[78]

International Conference on Artificial Intelligence and Statistics , pages=

Non-parametric calibration for classification , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020
[79]

Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Zadrozny, Bianca and Elkan, Charles , title =. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2002 , isbn =

work page 2002
[80]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Obtaining Well Calibrated Probabilities Using. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2015 , month=

work page 2015
[81]

Advances in large margin classifiers , volume=

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods , author=. Advances in large margin classifiers , volume=. 1999 , publisher=

work page 1999

Showing first 80 references.

[1] [1]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2019

work page 2019

[2] [2]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[3] [4]

L ocal T emperature S caling for probability calibration

Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. L ocal T emperature S caling for probability calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[4] [5]

An Introduction to the Bootstrap

Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1994

work page 1994

[5] [6]

Deep ensembles: A loss landscape perspective

Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. In International Conference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[6] [7]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp.\ 1321--1330, 2017

work page 2017

[7] [8]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019

work page 2019

[8] [9]

Jaeger, Simon A

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU - Net : a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18 0 (2): 0 203--211, February 2021. ISSN 1548-7105

work page 2021

[9] [10]

ValUES : A framework for systematic validation of uncertainty estimation in semantic segmentation

Kim-Celine Kahl, Carsten T L \"u th, Maximilian Zenk, Klaus Maier-Hein, and Paul F Jaeger. ValUES : A framework for systematic validation of uncertainty estimation in semantic segmentation. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[10] [11]

Benchmarking the robustness of semantic segmentation models

Christoph Kamann and Carsten Rother. Benchmarking the robustness of semantic segmentation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 8828--8838, 2020

work page 2020

[11] [12]

Uncertainty calibration with energy based instance-wise scaling in the wild dataset

Mijoo Kim and Junseok Kwon. Uncertainty calibration with energy based instance-wise scaling in the wild dataset. In Eur. Conf. Comput. Vis. (ECCV), 2024

work page 2024

[12] [13]

Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with D irichlet calibration

Meelis Kull, Miquel Perello-Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with D irichlet calibration. In Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[13] [14]

Trainable calibration measures for neural networks from kernel mean embeddings

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2805--2814. PMLR, 2018

work page 2018

[14] [15]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[15] [16]

We care each pixel: Calibrating medical segmentation models

Wenhao Liang, Wei Zhang, Lin Yue, Miao Xu, Olaf Maennel, and Weitong Chen. We care each pixel: Calibrating medical segmentation models. In Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025

work page 2025

[16] [17]

Energy-based out-of-distribution detection

Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 21464--21475, 2020

work page 2020

[17] [18]

Wells, Clare M

Alireza Mehrtash, William M. Wells, Clare M. Tempany, Purang Abolmaesumi, and Tina Kapur. Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging (T-MI), 39 0 (12): 0 3868--3878, 2020

work page 2020

[18] [19]

D irichlet-based gaussian processes for large-scale calibrated classification

Dimitrios Milios, Raffaello Camoriano, Pietro Michiardi, Lorenzo Rosasco, and Maurizio Filippone. D irichlet-based gaussian processes for large-scale calibrated classification. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

work page 2018

[19] [20]

Machine Learning for Aerial Image Labeling

Volodymyr Mnih. Machine Learning for Aerial Image Labeling. PhD thesis, University of Toronto, 2013

work page 2013

[20] [21]

Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H. S. Torr, and Puneet K. Dokania. Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems, volume 33, pp.\ 15288--15299, 2020

work page 2020

[21] [22]

When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, 2019

Rafael M \"u ller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[22] [23]

Sculley, Sebastian Nowozin, Joshua V

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[23] [24]

Obtaining well calibrated probabilities using B ayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using B ayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29 0 (1), Feb. 2015

work page 2015

[24] [25]

PyTorch : An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, and Francisco Massa. PyTorch : An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[25] [26]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10 0 (3): 0 61--74, 1999

work page 1999

[26] [27]

Rahul Rahaman and Alexandre H. Thiery. Uncertainty quantification and deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[27] [28]

Intra order-preserving functions for calibration of multi-class neural networks

Amir Rahimi, Amirreza Shaban, Ching-An Cheng, Richard Hartley, and Byron Boots. Intra order-preserving functions for calibration of multi-class neural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 13456--13467, 2020

work page 2020

[28] [29]

Post hoc calibration of medical segmentation models

Axel-Jan Rousseau, Thijs Becker, Simon Appeltans, Matthew Blaschko, and Dirk Valkenborg. Post hoc calibration of medical segmentation models. Discover Applied Sciences, 7 0 (3): 0 180, 2025

work page 2025

[29] [30]

Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations

Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In International Workshop on Deep Learning in Medical Image Analysis, pp.\ 240--248. Springer, 2017

work page 2017

[30] [31]

Oliphant

Pauli Virtanen, Ralf Gommers, and Travis E. Oliphant. SciPy 1.0: Fundamental algorithms for scientific computing in P ython. Nature Methods, 17: 0 261--272, 2020

work page 2020

[31] [32]

On calibrating semantic segmentation models: Analyses and an algorithm

Dongdong Wang, Boqing Gong, and Liqiang Wang. On calibrating semantic segmentation models: Analyses and an algorithm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[32] [33]

Non-parametric calibration for classification

Jonathan Wenger, Hedvig Kjellstr \"o m, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pp.\ 178--190. PMLR, 2020

work page 2020

[33] [34]

Learning and making decisions when costs and probabilities are both unknown

Bianca Zadrozny and Charles Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp.\ 204--213, 2001

work page 2001

[34] [35]

Transforming classifier scores into accurate multiclass probability estimates

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02, pp.\ 694–699, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 158113567X

work page 2002

[35] [36]

Lieffrig, Lawrence H

Tal Zeevi, El \'e onore V. Lieffrig, Lawrence H. Staib, and John A. Onofrey. Spatially-aware evaluation of segmentation uncertainty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2025

work page 2025

[36] [37]

Jäger, and Klaus Maier-Hein

Maximilian Zenk, David Zimmerer, Fabian Isensee, Jeremias Traub, Tobias Norajitra, Paul F. Jäger, and Klaus Maier-Hein. Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation. Medical Image Analysis, 101: 0 103392, 2025. ISSN 1361-8415

work page 2025

[37] [38]

Mix-n-Match : Ensemble and compositional methods for uncertainty calibration in deep learning

Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. Mix-n-Match : Ensemble and compositional methods for uncertainty calibration in deep learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp.\ 11117--11128. PMLR, 2020

work page 2020

[38] [39]

FirstName Alpher , title =

work page

[39] [40]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

work page

[40] [41]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

work page

[41] [42]

FirstName Alpher and FirstName Gamow , title =

work page

[42] [43]

Computer Vision -- ECCV 2022 , year =

work page 2022

[43] [44]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[44] [45]

International Conference on Neural Information Processing Systems (NeurIPS) , year=

Deep Ensembles: A Loss Landscape Perspective , author=. International Conference on Neural Information Processing Systems (NeurIPS) , year=

work page

[45] [46]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty under Dataset Shift , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[46] [47]

Proceedings of the 34th International Conference on Machine Learning (ICML) , year=

On Calibration of Modern Neural Networks , author=. Proceedings of the 34th International Conference on Machine Learning (ICML) , year=

work page

[47] [48]

Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with

Kull, Meelis and Perello-Nieto, Miquel and Kängsepp, Markus and Silva Filho, Telmo and Song, Hao and Flach, Peter , booktitle=. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with

work page

[48] [49]

Ding, Zhipeng and Han, Xu and Liu, Peirong and Niethammer, Marc , booktitle=

work page

[49] [50]

IEEE Transactions on Medical Imaging (T-MI) , volume=

Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation , author=. IEEE Transactions on Medical Imaging (T-MI) , volume=

work page

[50] [51]

Discover Applied Sciences , volume=

Post hoc calibration of medical segmentation models , author=. Discover Applied Sciences , volume=

work page

[51] [52]

Medical Image Analysis , volume=

Neighbor-aware calibration of segmentation networks with penalty-based constraints , author=. Medical Image Analysis , volume=

work page

[52] [53]

International Conference on Learning Representations (ICLR) , year=

Pitfalls of In-domain Uncertainty Estimation and Ensembling in Deep Learning , author=. International Conference on Learning Representations (ICLR) , year=

work page

[53] [54]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[54] [55]

Proceedings of the 37th International Conference on Machine Learning (ICML) , year=

Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? , author=. Proceedings of the 37th International Conference on Machine Learning (ICML) , year=

work page

[55] [56]

Medical Image Analysis , year=

A Review of Uncertainty Quantification in Medical Image Analysis: Probabilistic and Non-Probabilistic Methods , author=. Medical Image Analysis , year=

work page

[56] [57]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

On Calibrating Semantic Segmentation Models: Analyses and an Algorithm , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[57] [58]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Uncertainty Quantification and Deep Ensembles , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[58] [59]

Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) , year=

We Care Each Pixel: Calibrating Medical Segmentation Models , author=. Proceedings of the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) , year=

work page

[59] [60]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

Spatially-Aware Evaluation of Segmentation Uncertainty , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

work page

[60] [61]

The 2024

Correia de Verdier, Maria and Saluja, Rachit and Gagnon, Louis and LaBella, Dominic and Baid, Ujjwall and Tahon, Nourel Hoda and Foltyn-Dumitru, Martha and Zhang, Jikai and Alafif, Maram and Baig, Saif and Chang, Ken and D'Anna, Gennaro and Deptula, Lisa and Gupta, Diviya and Haider, Muhammad Ammar and Hussain, Ali and Iv, Michael and Kontzialis, Marinos ...

work page arXiv 2024

[61] [62]

and Kushibar, Kaisar and Martín-Isla, Carlos and Radeva, Petia and Lekadir, Karim and Barfoot, Theodore and Garcia Peraza Herrera, Luis C

Riera-Marín, Meritxell and O.K., Sikha and Rodríguez-Comas, Júlia and May, Matthias Stefan and Pan, Zhaohong and Zhou, Xiang and Liang, Xiaokun and Erick, Franciskus Xaverius and Prenner, Andrea and Hémon, Cédric and Boussot, Valentin and Dillenseger, Jean-Louis and Nunes, Jean-Claude and Qayyum, Abdul and Mazher, Moona and Niederer, Steven A. and Kushiba...

work page doi:10.1016/j.compbiomed.2025.111024 2025

[62] [63]

Mnih, Volodymyr , title =

work page

[63] [64]

2025 , eprint=

Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation , author=. 2025 , eprint=

work page 2025

[64] [65]

2021 , pages =

Nature Methods , author =. 2021 , pages =

work page 2021

[65] [66]

Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers , volume =

Zadrozny, Bianca and Elkan, Charles , year =. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers , volume =

work page

[66] [67]

Verified Uncertainty Calibration , volume =

Kumar, Ananya and Liang, Percy S and Ma, Tengyu , booktitle =. Verified Uncertainty Calibration , volume =

work page

[67] [68]

Cordts, Marius and Omran, Mohamed and Ramos, Sebastian and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and Schiele, Bernt , booktitle=. The

work page

[68] [69]

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset , author=. Eur. Conf. Comput. Vis. (ECCV) , year=

work page

[69] [70]

2021 , eprint=

Should Ensemble Members Be Calibrated? , author=. 2021 , eprint=

work page 2021

[70] [71]

Transactions on Machine Learning Research (TMLR) , issn=

On Joint Regularization and Calibration in Deep Ensembles , author=. Transactions on Machine Learning Research (TMLR) , issn=. 2025 , note=

work page 2025

[71] [72]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Deep ensembles work, but are they necessary? , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page

[72] [73]

, author=

The impact of averaging logits over probabilities on ensembles of neural networks. , author=. AISafety@ IJCAI , pages=

work page

[73] [74]

Transactions on Machine Learning Research (TMLR) , year=

Where are we with calibration under dataset shift in image classification? , author=. Transactions on Machine Learning Research (TMLR) , year=

work page

[74] [75]

Proceedings of the 25th

Optuna: A Next-generation Hyperparameter Optimization Framework , author=. Proceedings of the 25th

work page

[75] [76]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page

[76] [77]

2020 , organization=

Zhang, Jize and Kailkhura, Bhavya and Han, T Yong-Jin , booktitle=. 2020 , organization=

work page 2020

[77] [78]

International Conference on Artificial Intelligence and Statistics , pages=

Non-parametric calibration for classification , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020

[78] [79]

Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Zadrozny, Bianca and Elkan, Charles , title =. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2002 , isbn =

work page 2002

[79] [80]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Obtaining Well Calibrated Probabilities Using. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2015 , month=

work page 2015

[80] [81]

Advances in large margin classifiers , volume=

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods , author=. Advances in large margin classifiers , volume=. 1999 , publisher=

work page 1999