Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models
Pith reviewed 2026-05-10 15:59 UTC · model grok-4.3
The pith
A gradient-variance method detects erroneous ground truth labels during training of echocardiography segmentation models and refurbishes them with pseudo-labels to improve accuracy when errors are high.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Variance of Gradients proved highly effective at identifying erroneous ground truth labels while training proceeded. A standard U-Net retained strong segmentation performance under random label errors and under systematic errors up to 50 percent. Applying the detection step followed by pseudo-label refurbishment produced further gains, especially in the high-error regimes.
What carries the argument
Variance of Gradients (VOG) detection of label errors paired with pseudo-labelling for refurbishment
If this is right
- VOG-based detection can locate bad labels without requiring a separate clean validation set.
- Refurbishing flagged labels raises segmentation accuracy most noticeably once error rates become high.
- A standard U-Net already tolerates random errors and moderate systematic errors without any correction step.
- The approach reduces the performance penalty that would otherwise arise from imperfect manual annotations.
Where Pith is reading between the lines
- The same detection-plus-refurbishment loop could be tested on other medical segmentation tasks where annotation noise is common.
- Combining VOG detection with active learning might further reduce the total number of expert annotations needed.
- Clinical deployment would still require checking whether the method flags the same error patterns that appear in multi-center or multi-observer datasets.
Load-bearing premise
The three kinds of simulated label errors accurately represent the annotation mistakes that occur in real clinical echocardiography data.
What would settle it
Apply the VOG detection and pseudo-label refurbishment pipeline to a set of echocardiography images whose ground-truth labels have been independently re-annotated by multiple experts and check whether the measured improvement matches the gains seen with simulated errors.
read the original abstract
Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the robustness of U-Net models for echocardiography segmentation to ground truth label errors on the CAMUS dataset. It simulates three types of errors (random and systematic), compares a loss-based detection approach to one using Variance of Gradients (VOG), and proposes a pseudo-labelling refurbishment strategy for suspected erroneous labels. The central claims are that VOG is highly effective at flagging errors during training, standard models remain robust to random errors and moderate systematic errors (up to 50%), and the detection-plus-refurbishment pipeline yields performance gains especially at high error rates.
Significance. If the results hold beyond the simulations, the work could help mitigate the impact of annotation noise in clinical deep learning pipelines, reducing reliance on perfectly curated ground truth and improving model reliability in echocardiography. The use of a public dataset supports reproducibility, and the during-training detection approach is practically appealing. However, the significance depends on whether the simulated errors capture real clinical annotation patterns.
major comments (2)
- [Error simulation and evaluation sections] Error simulation and evaluation sections: The claims about VOG's high effectiveness in flagging erroneous GT labels and the benefits of pseudo-label refurbishment rest entirely on experiments with three simulated error types. No validation against actual clinician-annotated errors (e.g., boundary shifts due to acoustic shadowing or consistent under-segmentation) is provided, which is load-bearing for generalizability and the reported robustness thresholds (up to 50%).
- [Results section] Results section: The abstract states positive outcomes for detection and refurbishment but the provided description lacks specific quantitative metrics (Dice/IoU scores, detection precision/recall, ablation results, or statistical tests). This makes it impossible to verify the magnitude of improvements or the cross-error-level claims without the full tables and figures.
minor comments (2)
- [Abstract] Abstract: Adding one or two key numerical results (e.g., detection accuracy or Dice improvement at high error rates) would make the summary more informative.
- [Methods] Methods: The exact computation of VOG (e.g., which layers, how variance is aggregated) and the pseudo-labelling threshold should be stated more explicitly for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: Error simulation and evaluation sections: The claims about VOG's high effectiveness in flagging erroneous GT labels and the benefits of pseudo-label refurbishment rest entirely on experiments with three simulated error types. No validation against actual clinician-annotated errors (e.g., boundary shifts due to acoustic shadowing or consistent under-segmentation) is provided, which is load-bearing for generalizability and the reported robustness thresholds (up to 50%).
Authors: We acknowledge that our evaluation relies exclusively on three types of simulated label errors rather than real clinician annotations. This controlled simulation approach enables precise quantification of error impact and method performance across error rates, which would be difficult with uncontrolled real-world noise. We agree this limits direct claims about clinical annotation patterns. In the revised manuscript we will expand the Discussion and add a dedicated Limitations subsection that explicitly relates the simulated errors to common clinical issues (e.g., boundary shifts from shadowing) and states that validation on expert-verified erroneous labels remains future work. The current results still demonstrate the relative effectiveness of VOG versus loss-based detection under known noise conditions. revision: partial
-
Referee: Results section: The abstract states positive outcomes for detection and refurbishment but the provided description lacks specific quantitative metrics (Dice/IoU scores, detection precision/recall, ablation results, or statistical tests). This makes it impossible to verify the magnitude of improvements or the cross-error-level claims without the full tables and figures.
Authors: The full manuscript (Section 4 and associated tables/figures) already reports the requested quantitative results: Dice and IoU scores across all error types and rates, detection precision/recall for VOG versus the loss-based baseline, ablation experiments isolating the refurbishment step, and statistical tests (paired t-tests with p-values). The robustness threshold of 50% for systematic errors and the performance gains from refurbishment at high error rates are directly supported by these metrics. To improve accessibility we will revise the abstract to include a concise summary of the key numerical findings (e.g., typical Dice improvement after refurbishment at 70% error). revision: yes
- Empirical validation against a dataset containing actual clinician-annotated ground-truth errors (as opposed to simulated errors)
Circularity Check
No circularity: empirical evaluation on simulated errors is self-contained
full rationale
The paper reports experimental results from training U-Net models on the CAMUS dataset with three types of simulated label errors, comparing a loss-based detector to VOG and testing a pseudo-label refurbishment strategy. Performance metrics are measured directly against the simulated ground truth, with no mathematical derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the central claims to the method's own inputs. The evaluation protocol (error simulation, training, and metric computation) is independent and externally verifiable on the public dataset.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Deep learning (DL) models have been widely proposed for automating the segmentation of cardiac structures from echocardiography (echo) images [1, 2]. Subsequently, these segmentations are often used to calculate functional biomark- ers such as ejection fraction and therefore their reliability is of crucial importance in patient management [3,...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
MATERIALS AND METHODS 2.1. Dataset We used the CAMUS dataset [1], which contains 2D echo images with manual labellings of the left ventricle (LV), left ventricular myocardium (LVM) and left atrium (LA). Specifi- cally, we selected the apical four-chamber (A4C) view images at end-diastole (ED), resulting in a total of 500 annotated sam- ples. 2.2. GT Label...
-
[3]
EXPERIMENTS AND RESULTS The purpose of these experiments is to evaluate (i) the impact of GT label errors on segmentation performance and (ii) the effectiveness of the GT label error detection and refurbish- ment method. 3.1. Experiment 1: GT Label Error Detection To evaluate the effectiveness of our method for identifying GT label errors, we conducted an...
-
[4]
Random label errors had surprisingly lit- tle effect on final performance
DISCUSSION AND CONCLUSION Across all experiments, we observed that the baseline seg- mentation model was surprisingly robust to a wide range of GT label errors. Random label errors had surprisingly lit- tle effect on final performance. For systematic errors, refur- bishment provided some improvement, especially at moderate levels of corruption. This sugge...
-
[5]
Ethi- cal approval was not required as confirmed by the license at- tached with the open access data
COMPLIANCE WITH ETHICAL STANDARDS This research study was conducted retrospectively using hu- man subject data made available in open access [1]. Ethi- cal approval was not required as confirmed by the license at- tached with the open access data
-
[6]
ACKNOWLEDGMENTS We would like to acknowledge funding from the EPSRC Cen- tre for Doctoral Training in Medical Imaging (EP/L015226/1)
-
[7]
Deep Learning for Segmenta- tion Using an Open Large-Scale Dataset in 2D Echocar- diography,
S. Leclerc, E. Smistad, J. Pedrosa, A. Ostvik, F. Cerve- nansky, F. Espinosa, T. Espeland, E. A. R. Berg, P.-M. Jodoin, T. Grenier, C. Lartizien, J. Dhooge, L. Lovs- takken, and O. Bernard, “Deep Learning for Segmenta- tion Using an Open Large-Scale Dataset in 2D Echocar- diography,”IEEE Transactions on Medical Imaging, vol. 38, no. 9, pp. 2198–2210, Sep. 2019
work page 2019
-
[8]
J. Tromp, P. J. Seekings, C.-L. Hung, M. B. Iversen, M. J. Frost, W. Ouwerkerk, Z. Jiang, F. Eisenhaber, R. S. M. Goh, H. Zhao, W. Huang, L.-H. Ling, D. Sim, P. Cozzone, A. M. Richards, H. K. Lee, S. D. Solomon, C. S. P. Lam, and J. A. Ezekowitz, “Automated in- terpretation of systolic and diastolic function on the echocardiogram: a multicohort study,”The...
work page 2022
-
[9]
Ai-enabled assessment of cardiac systolic and diastolic function from echocardiography,
E. Puyol-Ant ´on, B. Ruijsink, B. S. Sidhu, J. Gould, B. Porter, M. K. Elliott, V . Mehta, H. Gu, C. A. Ri- naldi, M. cowieet al., “Ai-enabled assessment of cardiac systolic and diastolic function from echocardiography,” inInternational Workshop on Advances in Simplifying Medical Ultrasound. Springer, 2022, pp. 75–85
work page 2022
-
[10]
Deep learning interpre- tation of echocardiograms,
A. Ghorbani, D. Ouyang, A. Abid, B. He, J. H. Chen, R. A. Harrington, D. H. Liang, E. A. Ashley, and J. Y . Zou, “Deep learning interpre- tation of echocardiograms,”npj Digital Medicine, vol. 3, no. 1, p. 10, Jan. 2020. [Online]. Available: https://www.nature.com/articles/s41746-019-0216-8
work page 2020
-
[11]
J. Mariscal-Harana, C. Asher, V . Vergani, M. Rizvi, L. Keehn, R. J. Kim, R. M. Judd, S. E. Petersen, R. Razavi, A. P. Kinget al., “An artificial intelligence tool for automated analysis of large-scale unstructured clinical cine cardiac magnetic resonance databases,”Eu- ropean Heart Journal-Digital Health, vol. 4, no. 5, pp. 370–383, 2023
work page 2023
-
[12]
Selfie: Refurbishing unclean samples for robust deep learning,
H. Song, M. Kim, and J.-G. Lee, “Selfie: Refurbishing unclean samples for robust deep learning,” inInterna- tional conference on machine learning. PMLR, 2019, pp. 5907–5915
work page 2019
-
[13]
Dividemix: Learning with noisy labels as semi-supervised learning
J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,”arXiv preprint arXiv:2002.07394, 2020
-
[14]
Self-adaptive train- ing: beyond empirical risk minimization,
L. Huang, C. Zhang, and H. Zhang, “Self-adaptive train- ing: beyond empirical risk minimization,”Advances in neural information processing systems, vol. 33, pp. 19 365–19 376, 2020
work page 2020
-
[15]
B. Khanal, T. Dai, B. Bhattarai, and C. Linte, “Active label refinement for robust training of imbalanced med- ical image classification tasks in the presence of high label noise,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, pp. 37–47
work page 2024
-
[16]
Be- yond class-conditional assumption: A primary attempt to combat instance-dependent label noise,
P. Chen, J. Ye, G. Chen, J. Zhao, and P.-A. Heng, “Be- yond class-conditional assumption: A primary attempt to combat instance-dependent label noise,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 13, 2021, pp. 11 442–11 450
work page 2021
-
[17]
Co-teaching: Robust train- ing of deep neural networks with extremely noisy la- bels,
B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust train- ing of deep neural networks with extremely noisy la- bels,”Advances in neural information processing sys- tems, vol. 31, 2018
work page 2018
-
[18]
Uncertainty-based method for improving poorly labeled segmentation datasets,
E. Redekop and A. Chernyavskiy, “Uncertainty-based method for improving poorly labeled segmentation datasets,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 2021, pp. 1831– 1835
work page 2021
-
[19]
Cas- caded robust learning at imperfect labels for chest x-ray segmentation,
C. Xue, Q. Deng, X. Li, Q. Dou, and P.-A. Heng, “Cas- caded robust learning at imperfect labels for chest x-ray segmentation,” inMedical Image Computing and Com- puter Assisted Intervention. Springer, 2020, pp. 579– 588
work page 2020
-
[20]
Adaptive early-learning correction for seg- mentation from noisy annotations,
S. Liu, K. Liu, W. Zhu, Y . Shen, and C. Fernandez- Granda, “Adaptive early-learning correction for seg- mentation from noisy annotations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 2606–2616
work page 2022
-
[21]
Self-relabeling for noise-tolerant retina vessel segmentation through label reliability estimation,
J. Li, R. Li, R. Han, and S. Wang, “Self-relabeling for noise-tolerant retina vessel segmentation through label reliability estimation,”BMC Medical Imaging, vol. 22, no. 1, p. 8, 2022
work page 2022
-
[22]
U-net: Convo- lutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” inInternational Conference on Medical image com- puting and computer-assisted intervention. Springer, 2015, pp. 234–241
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.