On Revisiting Entropy for Identifying Mislabeled Images

Chunlei Li; Guanglu Dong; Jingliang Hu; Lichao Mou; Pengfei Li; Xiao Xiang Zhu; Yilei Shi; Zixuan Zheng

arxiv: 2605.31090 · v1 · pith:MVFQILJPnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

On Revisiting Entropy for Identifying Mislabeled Images

Chunlei Li , Zixuan Zheng , Yilei Shi , Guanglu Dong , Pengfei Li , Jingliang Hu , Xiao Xiang Zhu , Lichao Mou This is my paper

Pith reviewed 2026-06-28 23:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords mislabeled data detectionprediction entropytraining dynamicsmedical imaginglabel noisedeep neural networksCLIP

0 comments

The pith

A signed integral of prediction entropy changes during training identifies mislabeled images more reliably than prior techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that prediction entropy for correctly labeled samples decreases steadily over training epochs while remaining high for mislabeled samples. It introduces the signed entropy integral (SEI) to measure both the size and the direction of these entropy shifts as a detection score. This statistic is shown to work across standard classifiers and CLIP models. Experiments on four medical imaging datasets with varied modalities and pathologies support its use for cleaning training data. A reader would care because label errors are common in medical domains and degrade model performance.

Core claim

The central claim is that correctly labeled samples exhibit consistent entropy decrease during training while mislabeled samples maintain relatively high entropy throughout the process; the signed entropy integral (SEI) statistic captures both the magnitude and temporal trend of prediction entropy across training epochs and achieves state-of-the-art performance in mislabeled data identification on four medical imaging datasets while remaining computationally efficient.

What carries the argument

signed entropy integral (SEI), a cumulative sum of signed changes in prediction entropy between consecutive training epochs that encodes both level and trend of label correctness

If this is right

SEI outperforms existing mislabeled-data detectors on four medical imaging datasets spanning multiple modalities and pathologies.
The statistic applies directly to both standard classification networks and CLIP-based models without architectural changes.
SEI requires only the entropy values already computed during normal training, preserving computational efficiency.
The method identifies mislabeled samples by a single scalar per image after training completes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the entropy-separation pattern holds outside medicine, SEI could be tested on standard image-classification benchmarks with synthetic label noise to check broader applicability.
Entropy monitoring might be combined with loss-based cleaning methods to improve detection precision, though the paper evaluates SEI alone.
The same entropy trend could potentially guide early stopping or learning-rate schedules when label noise is suspected.

Load-bearing premise

Correctly labeled samples exhibit consistent entropy decrease during training while mislabeled samples maintain relatively high entropy throughout the process.

What would settle it

A controlled training run on a dataset with verified ground-truth labels in which the entropy trajectories of correct and incorrect samples fail to separate according to the described pattern would falsify the core premise.

Figures

Figures reproduced from arXiv: 2605.31090 by Chunlei Li, Guanglu Dong, Jingliang Hu, Lichao Mou, Pengfei Li, Xiao Xiang Zhu, Yilei Shi, Zixuan Zheng.

**Figure 1.** Figure 1: Training dynamics of prediction entropy for correctly labeled versus mislabeled samples. Left: Nevus images from the ISIC dataset, comparing correctly labeled samples (ground truth: nevus; given label: nevus) with mislabeled samples (ground truth: nevus; given label: other skin lesion categories). Right: Grade 4 diabetic retinopathy (DR) images from the DeepDRiD dataset. In both cases, correctly labeled sa… view at source ↗

**Figure 2.** Figure 2: Entropy trajectories for a mislabeled sample (ground truth: nevus; given label: melanoma) and a hard clean sample (ground truth: nevus; given label: nevus). Despite differing label correctness, their entropy curves are nearly indistinguishable. This illustrates that entropy alone cannot reliably distinguish mislabeled data from challenging but clean examples. 2.4. Identifying Mislabeled Data 2.4.1. Signed … view at source ↗

**Figure 3.** Figure 3: Training dynamics analysis through label-prediction alignment patterns over time. Bubble timeline charts illustrate the evolution of three sample categories during training: (a) easy clean samples, (b) hard clean samples, and (c) mislabeled samples. At each time step, • green circles indicate alignment between predicted and given labels, while • red circles denote misalignment. Circle size and opacity enco… view at source ↗

**Figure 4.** Figure 4: Illustration of SEI. The plots depict signed entropy curves over training epochs for easy clean (left), hard clean (middle), and mislabeled (right) samples. Each curve is averaged over 200 samples, and the signed area under each curve corresponds to the SEI. Correctly labeled samples yield larger SEIs than mislabeled ones. incorrectly labeled instances using training loss in a proposed sample sieve framew… view at source ↗

**Figure 5.** Figure 5: F1 score comparison between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. 4.3.6. Limitation Our work is designed for the standard hard-label setting, where each sample has a single assigned label and the goal is to identify mislabeled instances under this assumption. Settings involving uncertainty, annotato… view at source ↗

**Figure 6.** Figure 6: presents representative samples from each class across the three datasets employed in our study: ISIC, DeepDRiD, and PANDA. We display one exemplar image per class, organized with rows corresponding to individual datasets and columns representing distinct classes. This visualization facilitates direct comparison of class-specific visual characteristics. The corresponding text prompts utilized for training … view at source ↗

**Figure 7.** Figure 7: Training dynamics of prediction entropy for correctly labeled versus mislabeled samples. Top-left: melanoma images from the ISIC dataset. Top-right: grade 0 diabetic retinopathy (DR) images from the DeepDRiD dataset. Bottom-left: benign glandular epithelium images from the PANDA dataset. Bottom-right: Gleason 5 cancerous epithelium images from the PANDA dataset [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of SEI using melanoma images from the ISIC dataset. The plots show signed entropy curves across training epochs for easy clean (left), hard clean (middle), and mislabeled (right) samples. Each curve is averaged over 200 samples, and the signed area under the curve represents the SEI. Correctly labeled samples consistently exhibit larger SEIs than mislabeled ones. -1.0 -0.5 0.0 0.5 1.0 Signed E… view at source ↗

**Figure 9.** Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Illustration of SEI using Gleason 5 images from the PANDA dataset. F. Additional Results on Downstream Image Classification Beyond F1 score, we also report accuracy and AUC for SCE, M-correction, DivideMix, and ProMix, both with and without SEI, under the same protocol as Section 4.3.4. Results are presented in [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Score distributions for clean and noisy samples on the PANDA dataset. We compare our proposed SEI statistic (bottom row) against the unsigned Shannon entropy integral baseline (top row) for noise rates 𝜂 ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. Each column corresponds to increasing noise levels from left to right. The SEI statistic demonstrates better separation between clean and noisy sample distributions at all noi… view at source ↗

**Figure 12.** Figure 12: Accuracy comparison under confusion-calibrated noise between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. 0.1 0.2 0.3 0.4 0.5 81.0 87.0 93.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 81.0 87.5 94.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 80.0 86.5 93.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 85.0 90.0 95.0 AUC (%) [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 13.** Figure 13: AUC comparison under confusion-calibrated noise between baseline noisy label learning methods (green) and their SEIenhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: F1 score comparison under symmetric noise between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. 0.1 0.2 0.3 0.4 0.5 68.0 76.0 84.0 Accuracy (%) 0.1 0.2 0.3 0.4 0.5 71.0 78.0 85.0 Accuracy (%) 0.1 0.2 0.3 0.4 0.5 71.0 77.5 84.0 Accuracy (%) 0.1 0.2 0.3 0.4 0.5 72.0 79.5 87.0 Accuracy (%) [PITH_FULL_IMAGE:f… view at source ↗

**Figure 15.** Figure 15: Accuracy comparison under symmetric noise between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. 0.1 0.2 0.3 0.4 0.5 80.0 87.0 94.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 81.0 87.5 94.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 82.0 88.0 94.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 84.0 89.5 95.0 AUC (%) [PITH_FULL_IMAGE:figures/full_fig_p017… view at source ↗

**Figure 16.** Figure 16: AUC comparison under symmetric noise between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. (5e-3 or 1e-2). This trend mirrors standard classification training, where an appropriate learning rate is needed for effective learning of the base model itself [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

read the original abstract

Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEI is a workable entropy-based filter for medical label noise that rests on an empirical pattern without theoretical grounding.

read the letter

The paper introduces a signed entropy integral (SEI) that aggregates how prediction entropy evolves over training epochs. Correctly labeled samples are said to show steady entropy drop while mislabeled ones stay high, and the signed integral is meant to capture both size and direction of that change.

What is new is the particular signed aggregation; plain entropy or loss-based detection has been tried before. The work does well on the practical side: code is released, experiments cover four medical imaging datasets across modalities, and they use CLIP backbones that are common in the domain. That makes the method easy to test for anyone cleaning diagnostic data.

The soft spot is the load-bearing assumption. The entropy trajectory difference is presented as an observation from their runs, but there is no derivation or analysis showing why it should hold under different losses, regularizers, or noise levels. If both classes eventually reach low entropy after longer training, the statistic loses power. The SOTA claim is also tied to the specific setups shown; without reported ablations on integration limits or sign choice, it is unclear how sensitive the gains are.

This is for people who need a lightweight data-cleaning step in medical imaging pipelines. It is not foundational. A serious referee should see it because the experiments are on real data and the implementation is straightforward, even though the justification for the dynamics will need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the signed entropy integral (SEI) statistic for mislabeled sample detection. It is grounded in the empirical observation that correctly labeled examples exhibit consistent prediction-entropy decrease over training epochs while mislabeled examples maintain relatively high entropy; SEI integrates both magnitude and sign of this trend. The authors report that SEI achieves state-of-the-art identification performance on four medical imaging datasets spanning modalities and pathologies when used with CLIP backbones, while remaining computationally lightweight. Code is released.

Significance. If the entropy-dynamics observation proves robust, SEI would supply a simple, training-dynamics-based alternative to existing mislabel detectors that does not require auxiliary models or clean validation sets. The public code release is a clear strength. The restriction to CLIP backbones and four medical datasets, however, together with the purely empirical grounding, caps the result’s generality and long-term significance.

major comments (2)

[Abstract] Abstract: the central claim rests on the stated observation that “correctly labeled samples exhibit consistent entropy decrease … while mislabeled samples maintain relatively high entropy.” No derivation, no analysis of dependence on loss, optimizer, schedule, or label-noise fraction, and no examination of the regime in which both classes eventually reach low entropy are supplied; this directly affects whether SEI remains discriminative.
[Experiments] Experiments (implied by performance claims): the SOTA assertion is presented without error bars, without ablation on the sign choice or integration limits inside SEI, and without correction for multiple testing across four datasets; these omissions make it impossible to judge whether the reported outperformance is statistically reliable or dataset-specific.

minor comments (2)

[Abstract] The abstract would be clearer if it included the explicit integral definition of SEI rather than only describing its motivation.
A short limitations paragraph discussing possible failure modes (e.g., late-training entropy collapse for both classes) would improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The work is empirical and focuses on the practical performance of SEI for mislabeled sample detection in medical imaging with CLIP backbones. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim rests on the stated observation that “correctly labeled samples exhibit consistent entropy decrease … while mislabeled samples maintain relatively high entropy.” No derivation, no analysis of dependence on loss, optimizer, schedule, or label-noise fraction, and no examination of the regime in which both classes eventually reach low entropy are supplied; this directly affects whether SEI remains discriminative.

Authors: The central claim is presented explicitly as an empirical observation from training dynamics on the evaluated datasets, not as a theoretically derived property. The manuscript does not claim universality across all losses, optimizers, or noise levels, and the method is proposed for its utility in the medical imaging setting with CLIP. We did not include analysis of the late-training regime where both classes may reach low entropy because our experiments focus on the epochs where the distinction is observed and useful for detection. We can revise the abstract and add a limitations paragraph clarifying the empirical grounding and scope. revision: partial
Referee: [Experiments] Experiments (implied by performance claims): the SOTA assertion is presented without error bars, without ablation on the sign choice or integration limits inside SEI, and without correction for multiple testing across four datasets; these omissions make it impossible to judge whether the reported outperformance is statistically reliable or dataset-specific.

Authors: We agree that error bars from repeated runs and ablations on SEI components (sign and integration limits) would strengthen the results and will add them in revision. The four datasets are distinct in modality and pathology; we view the consistent outperformance as supportive rather than requiring formal multiple-testing correction, but we can add a clarifying statement. The SOTA claim is based on the reported metrics outperforming prior methods on these datasets. revision: yes

Circularity Check

0 steps flagged

No circularity; method is an empirical statistic validated on held-out datasets

full rationale

The paper grounds SEI in an empirical observation of entropy trajectories and defines the statistic directly from training dynamics; it then reports experimental performance on four medical datasets. No equations, fitted parameters, or self-citations are shown that reduce the claimed detection performance to the inputs by construction. The derivation chain is therefore self-contained and externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond the SEI definition itself.

pith-pipeline@v0.9.1-grok · 5730 in / 933 out tokens · 17472 ms · 2026-06-28T23:10:38.906490+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references

[1]

Tackling algorith- mic bias and promoting transparency in health datasets: the standing together consensus recommendations.The Lancet Digital Health, 7:e64–e88, 2025

Cole-Lewis, H., Glocker, B., et al. Tackling algorith- mic bias and promoting transparency in health datasets: the standing together consensus recommendations.The Lancet Digital Health, 7:e64–e88, 2025

2025
[2]

Unsupervised label noise modeling and loss cor- rection

Arazo, E., Ortego, D., Albert, P., O’Connor, N., and McGuin- ness, K. Unsupervised label noise modeling and loss cor- rection. InProceedings of the International Conference on Machine Learning, pp. 312–321, 2019

2019
[3]

Dirichlet-based per-sample weighting by transition matrix for noisy la- bel learning

Bae, H., Shin, S., Na, B., and Moon, I. Dirichlet-based per-sample weighting by transition matrix for noisy la- bel learning. InProceedings of the 12th International Conference on Learning Representations, 2024

2024
[4]

Chen, J., Ramanathan, V., Xu, T., and Martel, A. L. De- tecting noisy labels with repeated cross-validations. In Proceedings of the Medical Image Computing and Com- puter Assisted Intervention, pp. 197–207, 2024

2024
[5]

Understanding and utilizing deep neural networks trained with noisy labels

Chen, P., Liao, B., Chen, G., and Zhang, S. Understanding and utilizing deep neural networks trained with noisy labels. InProceedings of the International Conference on Machine Learning, pp. 1062–1070, 2019

2019
[6]

X., and Mou, L

Chen, Y., Wang, Q., Li, C., Hu, J., Shi, Y., Xiong, S., Zhu, X. X., and Mou, L. Propl: Universal semi-supervised ultrasound image segmentation via prompt-guided pseudo- labeling. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 3101–3110, 2026

2026
[7]

Class-dependent label-noise learning with cycle-consistency regularization

Han, B., and Liu, T. Class-dependent label-noise learning with cycle-consistency regularization. InProceedings of the Advances in Neural Information Processing Systems, pp. 11104–11116, 2022

2022
[8]

Learning with instance-dependent label noise: A sample sieve approach

Cheng, H., Zhu, Z., Li, X., Gong, Y., Sun, X., and Liu, Y. Learning with instance-dependent label noise: A sample sieve approach. InProceedings of the 9th International Conference on Learning Representations, 2021

2021
[9]

Weakly supervised learning with side information for noisy labeled images

Cheng, L., Zhou, X., Zhao, L., Li, D., Shang, H., Zheng, Y., Pan, P., and Xu, Y. Weakly supervised learning with side information for noisy labeled images. InProceedings of the European Conference on Computer Vision, pp. 306–321, 2020

2020
[10]

X., and Mou, L

Dong, L., Tan, Q., Li, C., Hu, J., Shi, Y., Dong, W., Zhu, X. X., and Mou, L. Dual distillation for few-shot anomaly detection. InProceedings of the 14th International Con- ference on Learning Representations, 2026. 9 On Revisiting Entropy for Identifying Mislabeled Images

2026
[11]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the 9th International Conference on Learning Representations, 2021

2021
[12]

and Azizpour, H

Englesson, E. and Azizpour, H. Robust classification via regression for learning with noisy labels. InProceed- ings of the 12th International Conference on Learning Representations, 2024

2024
[13]

Deep Self-Learning from noisy labels

Han, J., Luo, P., and Wang, X. Deep Self-Learning from noisy labels. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 5138–5147, 2019

2019
[14]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016

2016
[15]

O2U-Net: A simple noisy label detection approach for deep neural networks

Huang, J., Qu, L., Jia, R., and Zhao, B. O2U-Net: A simple noisy label detection approach for deep neural networks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3325–3333, 2019

2019
[16]

N., Lungren, M

Patel, B. N., Lungren, M. P., and Ng, A. Y. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 590–597, 2019

2019
[17]

Delving into sample loss curve to embrace noisy and imbalanced data

Jiang, S., Li, J., Wang, Y., Huang, B., Zhang, Z., and Xu, T. Delving into sample loss curve to embrace noisy and imbalanced data. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 7024–7032, 2022

2022
[18]

Splitnet: Learnable clean-noisy label splitting for learning with noisy labels

Kim, D., Ryoo, K., Cho, H., and Kim, S. Splitnet: Learnable clean-noisy label splitting for learning with noisy labels. International Journal of Computer Vision, 133(2):549– 566, 2025

2025
[19]

Learning discriminative dynamics with label corruption for noisy label detection

Kim, S., Lee, D., Kang, S., Chae, S., Jang, S., and Yu, H. Learning discriminative dynamics with label corruption for noisy label detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22477–22487, 2024

2024
[20]

J., and Ronneberger, O

Rezende, D. J., and Ronneberger, O. A probabilistic U-Net for segmentation of ambiguous images. InProceedings of the Advances in Neural Information Processing Systems, pp. 6965–6975, 2018

2018
[21]

Scale-aware contrastive reverse distillation for unsupervised medical anomaly detection

Li, C., Shi, Y., Hu, J., Zhu, X., and Mou, L. Scale-aware contrastive reverse distillation for unsupervised medical anomaly detection. InProceedings of the 13th Interna- tional Conference on Learning Representations, 2025

2025
[22]

Li, J., Socher, R., and Hoi, S. C. H. DivideMix: Learning with noisy labels as semi-supervised learning. InPro- ceedings of the 8th International Conference on Learning Representations, 2020

2020
[23]

Ciompi, F., Ghafoorian, M., van der Laak, J. A. W. M., van Ginneken, B., and S ´anchez, C. I. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017

2017
[24]

Deepdrid: Diabetic retinopathy-grading and image quality estimation challenge.Patterns, 3:100512, 2022

Jia, W., Shen, D., Sheng, B., and Zhang, P. Deepdrid: Diabetic retinopathy-grading and image quality estimation challenge.Patterns, 3:100512, 2022

2022
[25]

Early-learning regularization prevents memorization of noisy labels

Liu, S., Niles-Weed, J., Razavian, N., and Fernandez-Granda, C. Early-learning regularization prevents memorization of noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 20331–20342, 2020

2020
[26]

K., and Kumar, S

Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S. Does label smoothing mitigate label noise? In Proceedings of the International Conference on Machine Learning, pp. 6448–6458, 2020

2020
[27]

Semi-supervised medical image segmentation through dual-task consis- tency

Luo, X., Chen, J., Song, T., and Wang, G. Semi-supervised medical image segmentation through dual-task consis- tency. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 8801–8809, 2021

2021
[28]

Difficulty-aware attention network with confidence learning for medical image segmentation

Nie, D., Wang, L., Xiang, L., Zhou, S., Adeli, E., and Shen, D. Difficulty-aware attention network with confidence learning for medical image segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 1085–1092, 2019

2019
[29]

G., Jiang, L., and Chuang, I

Northcutt, C. G., Jiang, L., and Chuang, I. L. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021

2021
[30]

R., and Weinberger, K

Pleiss, G., Zhang, T., Elenberg, E. R., and Weinberger, K. Q. Identifying mislabeled data using the area under the margin ranking. InProceedings of the Advances in Neural Information Processing Systems, pp. 17044–17056, 2020. 10 On Revisiting Entropy for Identifying Mislabeled Images

2020
[31]

Learning transferable visual models from natural language supervision

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pp. 8748–8763, 2021

2021
[32]

Learning to reweight examples for robust deep learning

Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. InPro- ceedings of the International Conference on Machine Learning, pp. 4334–4343, 2018

2018
[33]

Imagenet large scale visual recognition challenge

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015

2015
[34]

A survey of label-noise deep learning for medical image analysis.Medical Image Analysis, 95:103166, 2024

Shi, J., Zhang, K., Guo, C., Yang, Y., Xu, Y., and Wu, J. A survey of label-noise deep learning for medical image analysis.Medical Image Analysis, 95:103166, 2024

2024
[35]

SELFIE: Refurbishing unclean samples for robust deep learning

Song, H., Kim, M., and Lee, J. SELFIE: Refurbishing unclean samples for robust deep learning. InProceedings of the International Conference on Machine Learning, pp. 5907–5915, 2019

2019
[36]

Noisegpt: La- bel noise detection and rectification through probability curvature

Wang, H., Huang, Z., Lin, Z., and Liu, T. Noisegpt: La- bel noise detection and rectification through probability curvature. InProceedings of the Advances in Neural Information Processing Systems, pp. 120159–120183, 2024

2024
[37]

Symmetric cross entropy for robust learning with noisy labels

Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J. Symmetric cross entropy for robust learning with noisy labels. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322–330, 2019

2019
[38]

Learning with noisy labels revisited: A study using real-world human annotations

Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., and Liu, Y. Learning with noisy labels revisited: A study using real-world human annotations. InProceedings of the 10th International Conference on Learning Representations, 2022

2022
[39]

Vision-Language models are strong noisy label detectors

Wei, T., Li, H., Li, C., Shi, J., Li, Y., and Zhang, M. Vision-Language models are strong noisy label detectors. InProceedings of the Advances in Neural Information Processing Systems, pp. 58154–58173, 2024

2024
[40]

Robust early-learning: Hindering the memorization of noisy labels

Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., and Chang, Y. Robust early-learning: Hindering the memorization of noisy labels. InProceedings of the 9th International Conference on Learning Representations, 2021

2021
[41]

ProMix: Combating label noise via maximizing clean sample utility

Xiao, R., Dong, Y., Wang, H., Feng, L., Wu, R., Chen, G., and Zhao, J. ProMix: Combating label noise via maximizing clean sample utility. InProceedings of the International Joint Conference on Artificial Intelligence, pp. 4442–4450, 2023

2023
[42]

Learning from massive noisy labeled data for image classification

Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. Learning from massive noisy labeled data for image classification. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2691–2699, 2015

2015
[43]

Learning from noisy labels via dynamic loss thresholding.IEEE Transactions on Knowledge and Data Engineering, 2023

Zhang, M.-L. Learning from noisy labels via dynamic loss thresholding.IEEE Transactions on Knowledge and Data Engineering, 2023

2023
[44]

Active negative loss functions for learning with noisy labels

Ye, X., Li, X., Dai, S., Liu, T., Sun, Y., and Tong, W. Active negative loss functions for learning with noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 6917–6940, 2023

2023
[45]

Early stopping against label noise without validation data

Yuan, S., Feng, L., and Liu, T. Early stopping against label noise without validation data. InProceedings of the 12th International Conference on Learning Representations, 2024

2024
[46]

LEMoN: Label error detection using multimodal neighbors

Zhang, H., Balagopalan, A., Oufattole, N., Jeong, H., Wu, Y., Zhu, J., and Ghassemi, M. LEMoN: Label error detection using multimodal neighbors. InProceedings of the International Conference on Machine Learning, 2025

2025
[47]

and Sabuncu, M

Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 8792–8802, 2018

2018
[48]

Strong aug

Zhu, Z., Dong, Z., and Liu, Y. Detecting corrupted labels without training a model to predict. InProceedings of the International Conference on Machine Learning, pp. 27412–27427, 2022. 11 On Revisiting Entropy for Identifying Mislabeled Images A. Datasets Figure 6 presents representative samples from each class across the three datasets employed in our st...

2022

[1] [1]

Tackling algorith- mic bias and promoting transparency in health datasets: the standing together consensus recommendations.The Lancet Digital Health, 7:e64–e88, 2025

Cole-Lewis, H., Glocker, B., et al. Tackling algorith- mic bias and promoting transparency in health datasets: the standing together consensus recommendations.The Lancet Digital Health, 7:e64–e88, 2025

2025

[2] [2]

Unsupervised label noise modeling and loss cor- rection

Arazo, E., Ortego, D., Albert, P., O’Connor, N., and McGuin- ness, K. Unsupervised label noise modeling and loss cor- rection. InProceedings of the International Conference on Machine Learning, pp. 312–321, 2019

2019

[3] [3]

Dirichlet-based per-sample weighting by transition matrix for noisy la- bel learning

Bae, H., Shin, S., Na, B., and Moon, I. Dirichlet-based per-sample weighting by transition matrix for noisy la- bel learning. InProceedings of the 12th International Conference on Learning Representations, 2024

2024

[4] [4]

Chen, J., Ramanathan, V., Xu, T., and Martel, A. L. De- tecting noisy labels with repeated cross-validations. In Proceedings of the Medical Image Computing and Com- puter Assisted Intervention, pp. 197–207, 2024

2024

[5] [5]

Understanding and utilizing deep neural networks trained with noisy labels

Chen, P., Liao, B., Chen, G., and Zhang, S. Understanding and utilizing deep neural networks trained with noisy labels. InProceedings of the International Conference on Machine Learning, pp. 1062–1070, 2019

2019

[6] [6]

X., and Mou, L

Chen, Y., Wang, Q., Li, C., Hu, J., Shi, Y., Xiong, S., Zhu, X. X., and Mou, L. Propl: Universal semi-supervised ultrasound image segmentation via prompt-guided pseudo- labeling. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 3101–3110, 2026

2026

[7] [7]

Class-dependent label-noise learning with cycle-consistency regularization

Han, B., and Liu, T. Class-dependent label-noise learning with cycle-consistency regularization. InProceedings of the Advances in Neural Information Processing Systems, pp. 11104–11116, 2022

2022

[8] [8]

Learning with instance-dependent label noise: A sample sieve approach

Cheng, H., Zhu, Z., Li, X., Gong, Y., Sun, X., and Liu, Y. Learning with instance-dependent label noise: A sample sieve approach. InProceedings of the 9th International Conference on Learning Representations, 2021

2021

[9] [9]

Weakly supervised learning with side information for noisy labeled images

Cheng, L., Zhou, X., Zhao, L., Li, D., Shang, H., Zheng, Y., Pan, P., and Xu, Y. Weakly supervised learning with side information for noisy labeled images. InProceedings of the European Conference on Computer Vision, pp. 306–321, 2020

2020

[10] [10]

X., and Mou, L

Dong, L., Tan, Q., Li, C., Hu, J., Shi, Y., Dong, W., Zhu, X. X., and Mou, L. Dual distillation for few-shot anomaly detection. InProceedings of the 14th International Con- ference on Learning Representations, 2026. 9 On Revisiting Entropy for Identifying Mislabeled Images

2026

[11] [11]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the 9th International Conference on Learning Representations, 2021

2021

[12] [12]

and Azizpour, H

Englesson, E. and Azizpour, H. Robust classification via regression for learning with noisy labels. InProceed- ings of the 12th International Conference on Learning Representations, 2024

2024

[13] [13]

Deep Self-Learning from noisy labels

Han, J., Luo, P., and Wang, X. Deep Self-Learning from noisy labels. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 5138–5147, 2019

2019

[14] [14]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016

2016

[15] [15]

O2U-Net: A simple noisy label detection approach for deep neural networks

Huang, J., Qu, L., Jia, R., and Zhao, B. O2U-Net: A simple noisy label detection approach for deep neural networks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3325–3333, 2019

2019

[16] [16]

N., Lungren, M

Patel, B. N., Lungren, M. P., and Ng, A. Y. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 590–597, 2019

2019

[17] [17]

Delving into sample loss curve to embrace noisy and imbalanced data

Jiang, S., Li, J., Wang, Y., Huang, B., Zhang, Z., and Xu, T. Delving into sample loss curve to embrace noisy and imbalanced data. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 7024–7032, 2022

2022

[18] [18]

Splitnet: Learnable clean-noisy label splitting for learning with noisy labels

Kim, D., Ryoo, K., Cho, H., and Kim, S. Splitnet: Learnable clean-noisy label splitting for learning with noisy labels. International Journal of Computer Vision, 133(2):549– 566, 2025

2025

[19] [19]

Learning discriminative dynamics with label corruption for noisy label detection

Kim, S., Lee, D., Kang, S., Chae, S., Jang, S., and Yu, H. Learning discriminative dynamics with label corruption for noisy label detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22477–22487, 2024

2024

[20] [20]

J., and Ronneberger, O

Rezende, D. J., and Ronneberger, O. A probabilistic U-Net for segmentation of ambiguous images. InProceedings of the Advances in Neural Information Processing Systems, pp. 6965–6975, 2018

2018

[21] [21]

Scale-aware contrastive reverse distillation for unsupervised medical anomaly detection

Li, C., Shi, Y., Hu, J., Zhu, X., and Mou, L. Scale-aware contrastive reverse distillation for unsupervised medical anomaly detection. InProceedings of the 13th Interna- tional Conference on Learning Representations, 2025

2025

[22] [22]

Li, J., Socher, R., and Hoi, S. C. H. DivideMix: Learning with noisy labels as semi-supervised learning. InPro- ceedings of the 8th International Conference on Learning Representations, 2020

2020

[23] [23]

Ciompi, F., Ghafoorian, M., van der Laak, J. A. W. M., van Ginneken, B., and S ´anchez, C. I. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017

2017

[24] [24]

Deepdrid: Diabetic retinopathy-grading and image quality estimation challenge.Patterns, 3:100512, 2022

Jia, W., Shen, D., Sheng, B., and Zhang, P. Deepdrid: Diabetic retinopathy-grading and image quality estimation challenge.Patterns, 3:100512, 2022

2022

[25] [25]

Early-learning regularization prevents memorization of noisy labels

Liu, S., Niles-Weed, J., Razavian, N., and Fernandez-Granda, C. Early-learning regularization prevents memorization of noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 20331–20342, 2020

2020

[26] [26]

K., and Kumar, S

Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S. Does label smoothing mitigate label noise? In Proceedings of the International Conference on Machine Learning, pp. 6448–6458, 2020

2020

[27] [27]

Semi-supervised medical image segmentation through dual-task consis- tency

Luo, X., Chen, J., Song, T., and Wang, G. Semi-supervised medical image segmentation through dual-task consis- tency. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 8801–8809, 2021

2021

[28] [28]

Difficulty-aware attention network with confidence learning for medical image segmentation

Nie, D., Wang, L., Xiang, L., Zhou, S., Adeli, E., and Shen, D. Difficulty-aware attention network with confidence learning for medical image segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 1085–1092, 2019

2019

[29] [29]

G., Jiang, L., and Chuang, I

Northcutt, C. G., Jiang, L., and Chuang, I. L. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021

2021

[30] [30]

R., and Weinberger, K

Pleiss, G., Zhang, T., Elenberg, E. R., and Weinberger, K. Q. Identifying mislabeled data using the area under the margin ranking. InProceedings of the Advances in Neural Information Processing Systems, pp. 17044–17056, 2020. 10 On Revisiting Entropy for Identifying Mislabeled Images

2020

[31] [31]

Learning transferable visual models from natural language supervision

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pp. 8748–8763, 2021

2021

[32] [32]

Learning to reweight examples for robust deep learning

Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. InPro- ceedings of the International Conference on Machine Learning, pp. 4334–4343, 2018

2018

[33] [33]

Imagenet large scale visual recognition challenge

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015

2015

[34] [34]

A survey of label-noise deep learning for medical image analysis.Medical Image Analysis, 95:103166, 2024

Shi, J., Zhang, K., Guo, C., Yang, Y., Xu, Y., and Wu, J. A survey of label-noise deep learning for medical image analysis.Medical Image Analysis, 95:103166, 2024

2024

[35] [35]

SELFIE: Refurbishing unclean samples for robust deep learning

Song, H., Kim, M., and Lee, J. SELFIE: Refurbishing unclean samples for robust deep learning. InProceedings of the International Conference on Machine Learning, pp. 5907–5915, 2019

2019

[36] [36]

Noisegpt: La- bel noise detection and rectification through probability curvature

Wang, H., Huang, Z., Lin, Z., and Liu, T. Noisegpt: La- bel noise detection and rectification through probability curvature. InProceedings of the Advances in Neural Information Processing Systems, pp. 120159–120183, 2024

2024

[37] [37]

Symmetric cross entropy for robust learning with noisy labels

Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J. Symmetric cross entropy for robust learning with noisy labels. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322–330, 2019

2019

[38] [38]

Learning with noisy labels revisited: A study using real-world human annotations

Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., and Liu, Y. Learning with noisy labels revisited: A study using real-world human annotations. InProceedings of the 10th International Conference on Learning Representations, 2022

2022

[39] [39]

Vision-Language models are strong noisy label detectors

Wei, T., Li, H., Li, C., Shi, J., Li, Y., and Zhang, M. Vision-Language models are strong noisy label detectors. InProceedings of the Advances in Neural Information Processing Systems, pp. 58154–58173, 2024

2024

[40] [40]

Robust early-learning: Hindering the memorization of noisy labels

Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., and Chang, Y. Robust early-learning: Hindering the memorization of noisy labels. InProceedings of the 9th International Conference on Learning Representations, 2021

2021

[41] [41]

ProMix: Combating label noise via maximizing clean sample utility

Xiao, R., Dong, Y., Wang, H., Feng, L., Wu, R., Chen, G., and Zhao, J. ProMix: Combating label noise via maximizing clean sample utility. InProceedings of the International Joint Conference on Artificial Intelligence, pp. 4442–4450, 2023

2023

[42] [42]

Learning from massive noisy labeled data for image classification

Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. Learning from massive noisy labeled data for image classification. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2691–2699, 2015

2015

[43] [43]

Learning from noisy labels via dynamic loss thresholding.IEEE Transactions on Knowledge and Data Engineering, 2023

Zhang, M.-L. Learning from noisy labels via dynamic loss thresholding.IEEE Transactions on Knowledge and Data Engineering, 2023

2023

[44] [44]

Active negative loss functions for learning with noisy labels

Ye, X., Li, X., Dai, S., Liu, T., Sun, Y., and Tong, W. Active negative loss functions for learning with noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 6917–6940, 2023

2023

[45] [45]

Early stopping against label noise without validation data

Yuan, S., Feng, L., and Liu, T. Early stopping against label noise without validation data. InProceedings of the 12th International Conference on Learning Representations, 2024

2024

[46] [46]

LEMoN: Label error detection using multimodal neighbors

Zhang, H., Balagopalan, A., Oufattole, N., Jeong, H., Wu, Y., Zhu, J., and Ghassemi, M. LEMoN: Label error detection using multimodal neighbors. InProceedings of the International Conference on Machine Learning, 2025

2025

[47] [47]

and Sabuncu, M

Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 8792–8802, 2018

2018

[48] [48]

Strong aug

Zhu, Z., Dong, Z., and Liu, Y. Detecting corrupted labels without training a model to predict. InProceedings of the International Conference on Machine Learning, pp. 27412–27427, 2022. 11 On Revisiting Entropy for Identifying Mislabeled Images A. Datasets Figure 6 presents representative samples from each class across the three datasets employed in our st...

2022