Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

Andrew J. Reader; Andrew P. King; Bram Ruijsink; Iman Islam

arxiv: 2604.12832 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

Iman Islam , Bram Ruijsink , Andrew J. Reader , Andrew P. King This is my paper

Pith reviewed 2026-05-10 15:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords echocardiography segmentationground truth label errorsdeep learning robustnessvariance of gradientspseudo-labellingmedical image segmentationU-NetCAMUS dataset

0 comments

The pith

A gradient-variance method detects erroneous ground truth labels during training of echocardiography segmentation models and refurbishes them with pseudo-labels to improve accuracy when errors are high.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning models for segmenting echocardiography images depend on manually annotated ground truth labels that often contain random errors or systematic biases. This work tests how well a standard U-Net tolerates such label noise on the CAMUS dataset and introduces a detection step based on the variance of gradients computed during training. Detected bad labels are then replaced by pseudo-labels generated by the model itself. The combined detection-and-refurbishment pipeline preserves or raises segmentation performance, with the largest gains appearing when label error rates exceed moderate levels.

Core claim

Variance of Gradients proved highly effective at identifying erroneous ground truth labels while training proceeded. A standard U-Net retained strong segmentation performance under random label errors and under systematic errors up to 50 percent. Applying the detection step followed by pseudo-label refurbishment produced further gains, especially in the high-error regimes.

What carries the argument

Variance of Gradients (VOG) detection of label errors paired with pseudo-labelling for refurbishment

If this is right

VOG-based detection can locate bad labels without requiring a separate clean validation set.
Refurbishing flagged labels raises segmentation accuracy most noticeably once error rates become high.
A standard U-Net already tolerates random errors and moderate systematic errors without any correction step.
The approach reduces the performance penalty that would otherwise arise from imperfect manual annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detection-plus-refurbishment loop could be tested on other medical segmentation tasks where annotation noise is common.
Combining VOG detection with active learning might further reduce the total number of expert annotations needed.
Clinical deployment would still require checking whether the method flags the same error patterns that appear in multi-center or multi-observer datasets.

Load-bearing premise

The three kinds of simulated label errors accurately represent the annotation mistakes that occur in real clinical echocardiography data.

What would settle it

Apply the VOG detection and pseudo-label refurbishment pipeline to a set of echocardiography images whose ground-truth labels have been independently re-annotated by multiple experts and check whether the measured improvement matches the gains seen with simulated errors.

read the original abstract

Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VOG flags simulated label errors well and refurbishment helps at high noise levels in echo segmentation, but the gains rest on simulations that may not match real clinician mistakes.

read the letter

The paper's main result is that a variance-of-gradients detector catches erroneous ground truth labels during training on the CAMUS dataset, and swapping in pseudo-labels for the flagged ones lifts segmentation performance when error rates are high. A plain U-Net already stays solid under random errors and up to 50% systematic ones, so the extra machinery mainly pays off in the worst-case simulated regimes. That is a concrete, usable observation for anyone training models on medical images where annotations are imperfect.

Referee Report

2 major / 2 minor

Summary. The paper examines the robustness of U-Net models for echocardiography segmentation to ground truth label errors on the CAMUS dataset. It simulates three types of errors (random and systematic), compares a loss-based detection approach to one using Variance of Gradients (VOG), and proposes a pseudo-labelling refurbishment strategy for suspected erroneous labels. The central claims are that VOG is highly effective at flagging errors during training, standard models remain robust to random errors and moderate systematic errors (up to 50%), and the detection-plus-refurbishment pipeline yields performance gains especially at high error rates.

Significance. If the results hold beyond the simulations, the work could help mitigate the impact of annotation noise in clinical deep learning pipelines, reducing reliance on perfectly curated ground truth and improving model reliability in echocardiography. The use of a public dataset supports reproducibility, and the during-training detection approach is practically appealing. However, the significance depends on whether the simulated errors capture real clinical annotation patterns.

major comments (2)

[Error simulation and evaluation sections] Error simulation and evaluation sections: The claims about VOG's high effectiveness in flagging erroneous GT labels and the benefits of pseudo-label refurbishment rest entirely on experiments with three simulated error types. No validation against actual clinician-annotated errors (e.g., boundary shifts due to acoustic shadowing or consistent under-segmentation) is provided, which is load-bearing for generalizability and the reported robustness thresholds (up to 50%).
[Results section] Results section: The abstract states positive outcomes for detection and refurbishment but the provided description lacks specific quantitative metrics (Dice/IoU scores, detection precision/recall, ablation results, or statistical tests). This makes it impossible to verify the magnitude of improvements or the cross-error-level claims without the full tables and figures.

minor comments (2)

[Abstract] Abstract: Adding one or two key numerical results (e.g., detection accuracy or Dice improvement at high error rates) would make the summary more informative.
[Methods] Methods: The exact computation of VOG (e.g., which layers, how variance is aggregated) and the pseudo-labelling threshold should be stated more explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: Error simulation and evaluation sections: The claims about VOG's high effectiveness in flagging erroneous GT labels and the benefits of pseudo-label refurbishment rest entirely on experiments with three simulated error types. No validation against actual clinician-annotated errors (e.g., boundary shifts due to acoustic shadowing or consistent under-segmentation) is provided, which is load-bearing for generalizability and the reported robustness thresholds (up to 50%).

Authors: We acknowledge that our evaluation relies exclusively on three types of simulated label errors rather than real clinician annotations. This controlled simulation approach enables precise quantification of error impact and method performance across error rates, which would be difficult with uncontrolled real-world noise. We agree this limits direct claims about clinical annotation patterns. In the revised manuscript we will expand the Discussion and add a dedicated Limitations subsection that explicitly relates the simulated errors to common clinical issues (e.g., boundary shifts from shadowing) and states that validation on expert-verified erroneous labels remains future work. The current results still demonstrate the relative effectiveness of VOG versus loss-based detection under known noise conditions. revision: partial
Referee: Results section: The abstract states positive outcomes for detection and refurbishment but the provided description lacks specific quantitative metrics (Dice/IoU scores, detection precision/recall, ablation results, or statistical tests). This makes it impossible to verify the magnitude of improvements or the cross-error-level claims without the full tables and figures.

Authors: The full manuscript (Section 4 and associated tables/figures) already reports the requested quantitative results: Dice and IoU scores across all error types and rates, detection precision/recall for VOG versus the loss-based baseline, ablation experiments isolating the refurbishment step, and statistical tests (paired t-tests with p-values). The robustness threshold of 50% for systematic errors and the performance gains from refurbishment at high error rates are directly supported by these metrics. To improve accessibility we will revise the abstract to include a concise summary of the key numerical findings (e.g., typical Dice improvement after refurbishment at 70% error). revision: yes

standing simulated objections not resolved

Empirical validation against a dataset containing actual clinician-annotated ground-truth errors (as opposed to simulated errors)

Circularity Check

0 steps flagged

No circularity: empirical evaluation on simulated errors is self-contained

full rationale

The paper reports experimental results from training U-Net models on the CAMUS dataset with three types of simulated label errors, comparing a loss-based detector to VOG and testing a pseudo-label refurbishment strategy. Performance metrics are measured directly against the simulated ground truth, with no mathematical derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the central claims to the method's own inputs. The evaluation protocol (error simulation, training, and metric computation) is independent and externally verifiable on the public dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard assumptions in deep learning for segmentation (U-Net architecture, supervised training) and the representativeness of the CAMUS dataset and simulated errors. No new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5469 in / 1197 out tokens · 41032 ms · 2026-05-10T15:59:26.937035+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Deep learning (DL) models have been widely proposed for automating the segmentation of cardiac structures from echocardiography (echo) images [1, 2]. Subsequently, these segmentations are often used to calculate functional biomark- ers such as ejection fraction and therefore their reliability is of crucial importance in patient management [3,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Dataset We used the CAMUS dataset [1], which contains 2D echo images with manual labellings of the left ventricle (LV), left ventricular myocardium (LVM) and left atrium (LA)

MATERIALS AND METHODS 2.1. Dataset We used the CAMUS dataset [1], which contains 2D echo images with manual labellings of the left ventricle (LV), left ventricular myocardium (LVM) and left atrium (LA). Specifi- cally, we selected the apical four-chamber (A4C) view images at end-diastole (ED), resulting in a total of 500 annotated sam- ples. 2.2. GT Label...

work page
[3]

EXPERIMENTS AND RESULTS The purpose of these experiments is to evaluate (i) the impact of GT label errors on segmentation performance and (ii) the effectiveness of the GT label error detection and refurbish- ment method. 3.1. Experiment 1: GT Label Error Detection To evaluate the effectiveness of our method for identifying GT label errors, we conducted an...

work page
[4]

Random label errors had surprisingly lit- tle effect on final performance

DISCUSSION AND CONCLUSION Across all experiments, we observed that the baseline seg- mentation model was surprisingly robust to a wide range of GT label errors. Random label errors had surprisingly lit- tle effect on final performance. For systematic errors, refur- bishment provided some improvement, especially at moderate levels of corruption. This sugge...

work page
[5]

Ethi- cal approval was not required as confirmed by the license at- tached with the open access data

COMPLIANCE WITH ETHICAL STANDARDS This research study was conducted retrospectively using hu- man subject data made available in open access [1]. Ethi- cal approval was not required as confirmed by the license at- tached with the open access data

work page
[6]

ACKNOWLEDGMENTS We would like to acknowledge funding from the EPSRC Cen- tre for Doctoral Training in Medical Imaging (EP/L015226/1)

work page
[7]

Deep Learning for Segmenta- tion Using an Open Large-Scale Dataset in 2D Echocar- diography,

S. Leclerc, E. Smistad, J. Pedrosa, A. Ostvik, F. Cerve- nansky, F. Espinosa, T. Espeland, E. A. R. Berg, P.-M. Jodoin, T. Grenier, C. Lartizien, J. Dhooge, L. Lovs- takken, and O. Bernard, “Deep Learning for Segmenta- tion Using an Open Large-Scale Dataset in 2D Echocar- diography,”IEEE Transactions on Medical Imaging, vol. 38, no. 9, pp. 2198–2210, Sep. 2019

work page 2019
[8]

Automated in- terpretation of systolic and diastolic function on the echocardiogram: a multicohort study,

J. Tromp, P. J. Seekings, C.-L. Hung, M. B. Iversen, M. J. Frost, W. Ouwerkerk, Z. Jiang, F. Eisenhaber, R. S. M. Goh, H. Zhao, W. Huang, L.-H. Ling, D. Sim, P. Cozzone, A. M. Richards, H. K. Lee, S. D. Solomon, C. S. P. Lam, and J. A. Ezekowitz, “Automated in- terpretation of systolic and diastolic function on the echocardiogram: a multicohort study,”The...

work page 2022
[9]

Ai-enabled assessment of cardiac systolic and diastolic function from echocardiography,

E. Puyol-Ant ´on, B. Ruijsink, B. S. Sidhu, J. Gould, B. Porter, M. K. Elliott, V . Mehta, H. Gu, C. A. Ri- naldi, M. cowieet al., “Ai-enabled assessment of cardiac systolic and diastolic function from echocardiography,” inInternational Workshop on Advances in Simplifying Medical Ultrasound. Springer, 2022, pp. 75–85

work page 2022
[10]

Deep learning interpre- tation of echocardiograms,

A. Ghorbani, D. Ouyang, A. Abid, B. He, J. H. Chen, R. A. Harrington, D. H. Liang, E. A. Ashley, and J. Y . Zou, “Deep learning interpre- tation of echocardiograms,”npj Digital Medicine, vol. 3, no. 1, p. 10, Jan. 2020. [Online]. Available: https://www.nature.com/articles/s41746-019-0216-8

work page 2020
[11]

An artificial intelligence tool for automated analysis of large-scale unstructured clinical cine cardiac magnetic resonance databases,

J. Mariscal-Harana, C. Asher, V . Vergani, M. Rizvi, L. Keehn, R. J. Kim, R. M. Judd, S. E. Petersen, R. Razavi, A. P. Kinget al., “An artificial intelligence tool for automated analysis of large-scale unstructured clinical cine cardiac magnetic resonance databases,”Eu- ropean Heart Journal-Digital Health, vol. 4, no. 5, pp. 370–383, 2023

work page 2023
[12]

Selfie: Refurbishing unclean samples for robust deep learning,

H. Song, M. Kim, and J.-G. Lee, “Selfie: Refurbishing unclean samples for robust deep learning,” inInterna- tional conference on machine learning. PMLR, 2019, pp. 5907–5915

work page 2019
[13]

Dividemix: Learning with noisy labels as semi-supervised learning

J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,”arXiv preprint arXiv:2002.07394, 2020

work page arXiv 2002
[14]

Self-adaptive train- ing: beyond empirical risk minimization,

L. Huang, C. Zhang, and H. Zhang, “Self-adaptive train- ing: beyond empirical risk minimization,”Advances in neural information processing systems, vol. 33, pp. 19 365–19 376, 2020

work page 2020
[15]

Active label refinement for robust training of imbalanced med- ical image classification tasks in the presence of high label noise,

B. Khanal, T. Dai, B. Bhattarai, and C. Linte, “Active label refinement for robust training of imbalanced med- ical image classification tasks in the presence of high label noise,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, pp. 37–47

work page 2024
[16]

Be- yond class-conditional assumption: A primary attempt to combat instance-dependent label noise,

P. Chen, J. Ye, G. Chen, J. Zhao, and P.-A. Heng, “Be- yond class-conditional assumption: A primary attempt to combat instance-dependent label noise,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 13, 2021, pp. 11 442–11 450

work page 2021
[17]

Co-teaching: Robust train- ing of deep neural networks with extremely noisy la- bels,

B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust train- ing of deep neural networks with extremely noisy la- bels,”Advances in neural information processing sys- tems, vol. 31, 2018

work page 2018
[18]

Uncertainty-based method for improving poorly labeled segmentation datasets,

E. Redekop and A. Chernyavskiy, “Uncertainty-based method for improving poorly labeled segmentation datasets,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 2021, pp. 1831– 1835

work page 2021
[19]

Cas- caded robust learning at imperfect labels for chest x-ray segmentation,

C. Xue, Q. Deng, X. Li, Q. Dou, and P.-A. Heng, “Cas- caded robust learning at imperfect labels for chest x-ray segmentation,” inMedical Image Computing and Com- puter Assisted Intervention. Springer, 2020, pp. 579– 588

work page 2020
[20]

Adaptive early-learning correction for seg- mentation from noisy annotations,

S. Liu, K. Liu, W. Zhu, Y . Shen, and C. Fernandez- Granda, “Adaptive early-learning correction for seg- mentation from noisy annotations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 2606–2616

work page 2022
[21]

Self-relabeling for noise-tolerant retina vessel segmentation through label reliability estimation,

J. Li, R. Li, R. Han, and S. Wang, “Self-relabeling for noise-tolerant retina vessel segmentation through label reliability estimation,”BMC Medical Imaging, vol. 22, no. 1, p. 8, 2022

work page 2022
[22]

U-net: Convo- lutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” inInternational Conference on Medical image com- puting and computer-assisted intervention. Springer, 2015, pp. 234–241

work page 2015

[1] [1]

INTRODUCTION Deep learning (DL) models have been widely proposed for automating the segmentation of cardiac structures from echocardiography (echo) images [1, 2]. Subsequently, these segmentations are often used to calculate functional biomark- ers such as ejection fraction and therefore their reliability is of crucial importance in patient management [3,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Dataset We used the CAMUS dataset [1], which contains 2D echo images with manual labellings of the left ventricle (LV), left ventricular myocardium (LVM) and left atrium (LA)

MATERIALS AND METHODS 2.1. Dataset We used the CAMUS dataset [1], which contains 2D echo images with manual labellings of the left ventricle (LV), left ventricular myocardium (LVM) and left atrium (LA). Specifi- cally, we selected the apical four-chamber (A4C) view images at end-diastole (ED), resulting in a total of 500 annotated sam- ples. 2.2. GT Label...

work page

[3] [3]

EXPERIMENTS AND RESULTS The purpose of these experiments is to evaluate (i) the impact of GT label errors on segmentation performance and (ii) the effectiveness of the GT label error detection and refurbish- ment method. 3.1. Experiment 1: GT Label Error Detection To evaluate the effectiveness of our method for identifying GT label errors, we conducted an...

work page

[4] [4]

Random label errors had surprisingly lit- tle effect on final performance

DISCUSSION AND CONCLUSION Across all experiments, we observed that the baseline seg- mentation model was surprisingly robust to a wide range of GT label errors. Random label errors had surprisingly lit- tle effect on final performance. For systematic errors, refur- bishment provided some improvement, especially at moderate levels of corruption. This sugge...

work page

[5] [5]

Ethi- cal approval was not required as confirmed by the license at- tached with the open access data

COMPLIANCE WITH ETHICAL STANDARDS This research study was conducted retrospectively using hu- man subject data made available in open access [1]. Ethi- cal approval was not required as confirmed by the license at- tached with the open access data

work page

[6] [6]

ACKNOWLEDGMENTS We would like to acknowledge funding from the EPSRC Cen- tre for Doctoral Training in Medical Imaging (EP/L015226/1)

work page

[7] [7]

Deep Learning for Segmenta- tion Using an Open Large-Scale Dataset in 2D Echocar- diography,

S. Leclerc, E. Smistad, J. Pedrosa, A. Ostvik, F. Cerve- nansky, F. Espinosa, T. Espeland, E. A. R. Berg, P.-M. Jodoin, T. Grenier, C. Lartizien, J. Dhooge, L. Lovs- takken, and O. Bernard, “Deep Learning for Segmenta- tion Using an Open Large-Scale Dataset in 2D Echocar- diography,”IEEE Transactions on Medical Imaging, vol. 38, no. 9, pp. 2198–2210, Sep. 2019

work page 2019

[8] [8]

Automated in- terpretation of systolic and diastolic function on the echocardiogram: a multicohort study,

J. Tromp, P. J. Seekings, C.-L. Hung, M. B. Iversen, M. J. Frost, W. Ouwerkerk, Z. Jiang, F. Eisenhaber, R. S. M. Goh, H. Zhao, W. Huang, L.-H. Ling, D. Sim, P. Cozzone, A. M. Richards, H. K. Lee, S. D. Solomon, C. S. P. Lam, and J. A. Ezekowitz, “Automated in- terpretation of systolic and diastolic function on the echocardiogram: a multicohort study,”The...

work page 2022

[9] [9]

Ai-enabled assessment of cardiac systolic and diastolic function from echocardiography,

E. Puyol-Ant ´on, B. Ruijsink, B. S. Sidhu, J. Gould, B. Porter, M. K. Elliott, V . Mehta, H. Gu, C. A. Ri- naldi, M. cowieet al., “Ai-enabled assessment of cardiac systolic and diastolic function from echocardiography,” inInternational Workshop on Advances in Simplifying Medical Ultrasound. Springer, 2022, pp. 75–85

work page 2022

[10] [10]

Deep learning interpre- tation of echocardiograms,

A. Ghorbani, D. Ouyang, A. Abid, B. He, J. H. Chen, R. A. Harrington, D. H. Liang, E. A. Ashley, and J. Y . Zou, “Deep learning interpre- tation of echocardiograms,”npj Digital Medicine, vol. 3, no. 1, p. 10, Jan. 2020. [Online]. Available: https://www.nature.com/articles/s41746-019-0216-8

work page 2020

[11] [11]

An artificial intelligence tool for automated analysis of large-scale unstructured clinical cine cardiac magnetic resonance databases,

J. Mariscal-Harana, C. Asher, V . Vergani, M. Rizvi, L. Keehn, R. J. Kim, R. M. Judd, S. E. Petersen, R. Razavi, A. P. Kinget al., “An artificial intelligence tool for automated analysis of large-scale unstructured clinical cine cardiac magnetic resonance databases,”Eu- ropean Heart Journal-Digital Health, vol. 4, no. 5, pp. 370–383, 2023

work page 2023

[12] [12]

Selfie: Refurbishing unclean samples for robust deep learning,

H. Song, M. Kim, and J.-G. Lee, “Selfie: Refurbishing unclean samples for robust deep learning,” inInterna- tional conference on machine learning. PMLR, 2019, pp. 5907–5915

work page 2019

[13] [13]

Dividemix: Learning with noisy labels as semi-supervised learning

J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,”arXiv preprint arXiv:2002.07394, 2020

work page arXiv 2002

[14] [14]

Self-adaptive train- ing: beyond empirical risk minimization,

L. Huang, C. Zhang, and H. Zhang, “Self-adaptive train- ing: beyond empirical risk minimization,”Advances in neural information processing systems, vol. 33, pp. 19 365–19 376, 2020

work page 2020

[15] [15]

Active label refinement for robust training of imbalanced med- ical image classification tasks in the presence of high label noise,

B. Khanal, T. Dai, B. Bhattarai, and C. Linte, “Active label refinement for robust training of imbalanced med- ical image classification tasks in the presence of high label noise,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, pp. 37–47

work page 2024

[16] [16]

Be- yond class-conditional assumption: A primary attempt to combat instance-dependent label noise,

P. Chen, J. Ye, G. Chen, J. Zhao, and P.-A. Heng, “Be- yond class-conditional assumption: A primary attempt to combat instance-dependent label noise,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 13, 2021, pp. 11 442–11 450

work page 2021

[17] [17]

Co-teaching: Robust train- ing of deep neural networks with extremely noisy la- bels,

B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust train- ing of deep neural networks with extremely noisy la- bels,”Advances in neural information processing sys- tems, vol. 31, 2018

work page 2018

[18] [18]

Uncertainty-based method for improving poorly labeled segmentation datasets,

E. Redekop and A. Chernyavskiy, “Uncertainty-based method for improving poorly labeled segmentation datasets,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 2021, pp. 1831– 1835

work page 2021

[19] [19]

Cas- caded robust learning at imperfect labels for chest x-ray segmentation,

C. Xue, Q. Deng, X. Li, Q. Dou, and P.-A. Heng, “Cas- caded robust learning at imperfect labels for chest x-ray segmentation,” inMedical Image Computing and Com- puter Assisted Intervention. Springer, 2020, pp. 579– 588

work page 2020

[20] [20]

Adaptive early-learning correction for seg- mentation from noisy annotations,

S. Liu, K. Liu, W. Zhu, Y . Shen, and C. Fernandez- Granda, “Adaptive early-learning correction for seg- mentation from noisy annotations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 2606–2616

work page 2022

[21] [21]

Self-relabeling for noise-tolerant retina vessel segmentation through label reliability estimation,

J. Li, R. Li, R. Han, and S. Wang, “Self-relabeling for noise-tolerant retina vessel segmentation through label reliability estimation,”BMC Medical Imaging, vol. 22, no. 1, p. 8, 2022

work page 2022

[22] [22]

U-net: Convo- lutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” inInternational Conference on Medical image com- puting and computer-assisted intervention. Springer, 2015, pp. 234–241

work page 2015