Interpretable Probabilistic Medical Image Segmentation via Gaussian Process with Explicit Modelling of Annotation Bias and Variability
Pith reviewed 2026-06-26 09:15 UTC · model grok-4.3
The pith
A stochastic variational Gaussian process decomposes image logits into a reference distribution plus explicit annotator bias and variance to improve uncertainty calibration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that explicitly modelling annotator-specific perturbations as bias and variance parameters in logit space, via a stochastic variational Gaussian process, yields improved uncertainty calibration while preserving segmentation accuracy comparable to state-of-the-art implicit multi-rater methods, and that the learned parameters quantitatively reflect annotator-specific behaviour.
What carries the argument
Stochastic variational Gaussian process that decomposes logits into an image-dependent reference distribution plus additive annotator-specific bias and variance perturbations
If this is right
- Uncertainty calibration improves relative to implicit multi-rater probabilistic segmentation methods.
- Learned bias and variance parameters quantitatively track annotator-specific behaviour.
- Controlled changes to annotator parameters produce systematic, predictable shifts in predictive performance.
- Segmentation accuracy remains comparable to current state-of-the-art implicit approaches.
Where Pith is reading between the lines
- The explicit parameters could support data curation strategies that weight or filter annotations according to measured bias and variance.
- The same decomposition might be tested on other multi-annotator tasks such as bounding-box detection or text labelling where rater effects are also present.
- Making the bias or variance terms themselves mildly image-dependent could be explored as a direct next step without abandoning the additive structure.
Load-bearing premise
Annotator-specific effects can be captured by additive bias and variance parameters in logit space that remain independent of image content and can be cleanly separated by the Gaussian process from the reference distribution.
What would settle it
If the method shows no improvement in uncertainty calibration metrics over implicit baselines or if the learned bias and variance values fail to correlate with measured annotator behaviour on the multi-annotator dataset, the central claim would be falsified.
Figures
read the original abstract
Deep learning-based medical image segmentation models are trained using annotations that exhibit systematic bias and variability across raters. While probabilistic multi-rater approaches can emulate annotator-specific delineations, annotator characteristics are typically encoded implicitly in deep latent feature space, making direct analysis of their influence on predictive distributions less straightforward. We propose a logit-space probabilistic segmentation framework based on stochastic variational Gaussian Process that explicitly decomposes predictions into an image-dependent reference logit distribution and annotator specific perturbations parameterised by bias and variance. This formulation enables more explicit analysis on how intra- and inter-rater variability propagate to predictive distributions. We evaluate the method on a multi-annotator medical image dataset, which shows that explicitly modelling annotator specific perturbations improves uncertainty calibration while maintaining comparable segmentation accuracy, compared with state-of-the-art multi-rater probabilistic segmentation method. The learned bias and variance parameters quantitatively reflect annotator-specific behaviour. Furthermore, controlled perturbation experiments over bias and variance demonstrate how changes in annotator parameters systematically influence predictive performance. The code used in this paper is made publicly available at https://github.com/QiLi111/GPS-Var.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a logit-space probabilistic segmentation framework using stochastic variational Gaussian Process (SVGP) that explicitly decomposes predictions into an image-dependent reference logit distribution and annotator-specific perturbations parameterized by bias and variance. It claims this yields improved uncertainty calibration while maintaining comparable segmentation accuracy versus state-of-the-art multi-rater probabilistic methods, with the learned parameters quantitatively reflecting annotator behavior; results are supported by evaluation on a multi-annotator medical image dataset and controlled perturbation experiments. Public code is released.
Significance. If the claims hold, the explicit additive decomposition in logit space provides a more interpretable handle on how intra- and inter-rater variability propagates to predictive distributions than implicit encoding in deep latent features. The public code release is a clear strength for reproducibility and further analysis.
minor comments (3)
- [Abstract] Abstract and §4: the claim of 'improved uncertainty calibration' would be strengthened by explicit reporting of the calibration metric (e.g., ECE) and the precise SOTA baseline used, together with any statistical significance tests.
- [§3] §3: the notation for the reference distribution versus annotator perturbations should be introduced with a clear diagram or equation block to aid readers in following the SVGP decomposition.
- The controlled perturbation experiments are described only at a high level; a table or figure summarizing the systematic changes in performance metrics as bias/variance parameters are varied would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. The report contains no specific major comments to address.
Circularity Check
No significant circularity identified
full rationale
The paper extends standard stochastic variational Gaussian Process (SVGP) machinery by adding explicit additive bias and variance parameters in logit space to decompose reference distribution from annotator effects. Central claims concern empirical improvements in uncertainty calibration on a multi-annotator dataset, with learned parameters reflecting annotator behavior; these are validated experimentally rather than derived by construction from fitted inputs. No load-bearing steps reduce to self-definition, fitted predictions renamed as outputs, or self-citation chains. The framework remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Nature communications13(1), 4128 (2022)
Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., Ronneberger, O., Summers, R.M., et al.: The medical segmentation decathlon. Nature communications13(1), 4128 (2022)
2022
-
[2]
In: International conference on medical image computing and computer-assisted intervention
Baumgartner, C.F., Tezcan, K.C., Chaitanya, K., Hötker, A.M., Muehlematter, U.J., Schawkat, K., Becker, A.S., Donati, O., Konukoglu, E.: Phiseg: Capturing uncertainty in medical image segmentation. In: International conference on medical image computing and computer-assisted intervention. pp. 119–127. Springer (2019)
2019
-
[3]
Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 4. Springer (2006)
2006
-
[4]
Information Sciences545, 771–790 (2021)
Campagner, A., Ciucci, D., Svensson, C.M., Figge, M.T., Cabitza, F.: Ground truthing from multi-rater labeling with three-way decision and possibility theory. Information Sciences545, 771–790 (2021)
2021
-
[5]
IEEE transactions on medical imaging39(11), 3679–3690 (2020)
Eelbode, T., Bertels, J., Berman, M., Vandermeulen, D., Maes, F., Bisschops, R., Blaschko, M.B.: Optimization for medical image segmentation: theory and practice when evaluating with dice score or jaccard index. IEEE transactions on medical imaging39(11), 3679–3690 (2020)
2020
-
[6]
In: Ad- vances in Neural Information Processing Systems (2018)
Gardner, J.R., Pleiss, G., Bindel, D., Weinberger, K.Q., Wilson, A.G.: Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In: Ad- vances in Neural Information Processing Systems (2018)
2018
-
[7]
Journal of the American statistical Association102(477), 359–378 (2007)
Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estima- tion. Journal of the American statistical Association102(477), 359–378 (2007)
2007
-
[8]
Medical Image Analysis71, 102053 (2021)
Grammatikopoulou, M., Flouty, E., Kadkhodamohammadi, A., Quellec, G., Chow, A., Nehme, J., Luengo, I., Stoyanov, D.: Cadis: Cataract dataset for surgical rgb- image segmentation. Medical Image Analysis71, 102053 (2021)
2021
-
[9]
In: International conference on machine learning
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International conference on machine learning. pp. 1321–1330. PMLR (2017) 10 Q. Li et al
2017
-
[10]
arXiv preprint arXiv:1309.6835 (2013)
Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. arXiv preprint arXiv:1309.6835 (2013)
Pith/arXiv arXiv 2013
-
[11]
In: Artificial intelligence and statistics
Hensman, J., Matthews, A., Ghahramani, Z.: Scalable variational gaussian process classification. In: Artificial intelligence and statistics. pp. 351–360. PMLR (2015)
2015
-
[12]
In: Inter- national Conference on Medical Image Computing and Computer-Assisted Inter- vention
Hu, S., Worrall, D., Knegt, S., Veeling, B., Huisman, H., Welling, M.: Supervised uncertainty quantification for segmentation with multiple annotations. In: Inter- national Conference on Medical Image Computing and Computer-Assisted Inter- vention. pp. 137–145. Springer (2019)
2019
-
[13]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ji, W., Yu, S., Wu, J., Ma, K., Bian, C., Bi, Q., Li, J., Liu, H., Cheng, L., Zheng, Y.: Learning calibrated medical image segmentation via multi-rater agreement model- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12341–12351 (2021)
2021
-
[14]
Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems30(2017)
2017
-
[15]
Advances in neural information processing sys- tems31(2018)
Kohl, S., Romera-Paredes, B., Meyer, C., De Fauw, J., Ledsam, J.R., Maier-Hein, K., Eslami, S., Jimenez Rezende, D., Ronneberger, O.: A probabilistic u-net for segmentation of ambiguous images. Advances in neural information processing sys- tems31(2018)
2018
-
[16]
arXiv preprint arXiv:1905.13077 (2019)
Kohl, S.A., Romera-Paredes, B., Maier-Hein, K.H., Rezende, D.J., Eslami, S., Kohli, P., Zisserman, A., Ronneberger, O.: A hierarchical probabilistic u-net for modeling multi-scale ambiguities. arXiv preprint arXiv:1905.13077 (2019)
Pith/arXiv arXiv 1905
-
[17]
BMC Research Notes15(1), 210 (2022)
Müller, D., Soto-Rey, I., Kramer, F.: Towards a guideline for evaluation metrics in medical image segmentation. BMC Research Notes15(1), 210 (2022)
2022
-
[18]
Journal of Machine Learning Research9(Oct), 2035–2078 (2008)
Nickisch, H., Rasmussen, C.E.: Approximations for binary gaussian process classi- fication. Journal of Machine Learning Research9(Oct), 2035–2078 (2008)
2035
-
[19]
In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Pan, Z., Zhang, H., Jin, M., Qin, M., Huang, W.: Uncertainty guided incremen- tal interactive medical image segmentation with sparse variational gaussian pro- cess. In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). pp. 1116–1121. IEEE (2024)
2024
-
[20]
In: Proceedings of the AAAI conference on artificial intelligence
Rodrigues, F., Pereira, F.: Deep learning from crowds. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
2018
-
[21]
In: Proceedings of the IEEE/CVF international conference on computer vision
Schmidt, A., Morales-Alvarez, P., Molina, R.: Probabilistic modeling of inter-and intra-observer variability in medical image segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 21097–21106 (2023)
2023
-
[22]
Ad- vances in neural information processing systems18(2005)
Snelson, E., Ghahramani, Z.: Sparse gaussian processes using pseudo-inputs. Ad- vances in neural information processing systems18(2005)
2005
-
[23]
NPJ Digital Medicine6(1), 26 (2023)
Sylolypavan, A., Sleeman, D., Wu, H., Sim, M.: The impact of inconsistent human annotations on ai driven clinical decision making. NPJ Digital Medicine6(1), 26 (2023)
2023
-
[24]
In: Artificial intelligence and statistics
Titsias, M.: Variational learning of inducing variables in sparse gaussian processes. In: Artificial intelligence and statistics. pp. 567–574. PMLR (2009)
2009
-
[25]
Nature communications12(1), 5915 (2021)
Wang, S., Li, C., Wang, R., Liu, Z., Wang, M., Tan, H., Wu, Y., Liu, X., Sun, H., Yang, R., et al.: Annotation-efficient deep learning for automatic medical image segmentation. Nature communications12(1), 5915 (2021)
2021
-
[26]
Advances in Neural Information Processing Systems 34, 13230–13241 (2021)
Wang, Z., Miao, Z., Zhen, X., Qiu, Q.: Learning to learn dense gaussian pro- cesses for few-shot learning. Advances in Neural Information Processing Systems 34, 13230–13241 (2021)
2021
-
[27]
Williams, C.K., Rasmussen, C.E.: Gaussian processes for machine learning, vol. 2. MIT press Cambridge, MA (2006)
2006
-
[28]
In: Artificial intelligence and statistics
Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P.: Deep kernel learning. In: Artificial intelligence and statistics. pp. 370–378. PMLR (2016) Interpretable Probabilistic Segmentation using SVGP 11
2016
-
[29]
European journal of nuclear medicine and molecular imaging37(11), 2165–2187 (2010)
Zaidi, H., El Naqa, I.: Pet-guided delineation of radiation therapy treatment vol- umes: a survey of image segmentation techniques. European journal of nuclear medicine and molecular imaging37(11), 2165–2187 (2010)
2010
-
[30]
In: The Eleventh International Conference on Learning Representations (2023)
Zepf, K., Petersen, E., Frellsen, J., Feragen, A.: That label’s got style: Handling label style bias for uncertain image segmentation. In: The Eleventh International Conference on Learning Representations (2023)
2023
-
[31]
Advances in Neural Information Processing Systems33, 15750– 15762 (2020)
Zhang, L., Tanno, R., Xu, M.C., Jin, C., Jacob, J., Cicarrelli, O., Barkhof, F., Alexander, D.: Disentangling human error from ground truth in segmentation of medical images. Advances in Neural Information Processing Systems33, 15750– 15762 (2020)
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.