pith. sign in

arxiv: 2605.31090 · v1 · pith:MVFQILJPnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

On Revisiting Entropy for Identifying Mislabeled Images

Pith reviewed 2026-06-28 23:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords mislabeled data detectionprediction entropytraining dynamicsmedical imaginglabel noisedeep neural networksCLIP
0
0 comments X

The pith

A signed integral of prediction entropy changes during training identifies mislabeled images more reliably than prior techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that prediction entropy for correctly labeled samples decreases steadily over training epochs while remaining high for mislabeled samples. It introduces the signed entropy integral (SEI) to measure both the size and the direction of these entropy shifts as a detection score. This statistic is shown to work across standard classifiers and CLIP models. Experiments on four medical imaging datasets with varied modalities and pathologies support its use for cleaning training data. A reader would care because label errors are common in medical domains and degrade model performance.

Core claim

The central claim is that correctly labeled samples exhibit consistent entropy decrease during training while mislabeled samples maintain relatively high entropy throughout the process; the signed entropy integral (SEI) statistic captures both the magnitude and temporal trend of prediction entropy across training epochs and achieves state-of-the-art performance in mislabeled data identification on four medical imaging datasets while remaining computationally efficient.

What carries the argument

signed entropy integral (SEI), a cumulative sum of signed changes in prediction entropy between consecutive training epochs that encodes both level and trend of label correctness

If this is right

  • SEI outperforms existing mislabeled-data detectors on four medical imaging datasets spanning multiple modalities and pathologies.
  • The statistic applies directly to both standard classification networks and CLIP-based models without architectural changes.
  • SEI requires only the entropy values already computed during normal training, preserving computational efficiency.
  • The method identifies mislabeled samples by a single scalar per image after training completes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the entropy-separation pattern holds outside medicine, SEI could be tested on standard image-classification benchmarks with synthetic label noise to check broader applicability.
  • Entropy monitoring might be combined with loss-based cleaning methods to improve detection precision, though the paper evaluates SEI alone.
  • The same entropy trend could potentially guide early stopping or learning-rate schedules when label noise is suspected.

Load-bearing premise

Correctly labeled samples exhibit consistent entropy decrease during training while mislabeled samples maintain relatively high entropy throughout the process.

What would settle it

A controlled training run on a dataset with verified ground-truth labels in which the entropy trajectories of correct and incorrect samples fail to separate according to the described pattern would falsify the core premise.

Figures

Figures reproduced from arXiv: 2605.31090 by Chunlei Li, Guanglu Dong, Jingliang Hu, Lichao Mou, Pengfei Li, Xiao Xiang Zhu, Yilei Shi, Zixuan Zheng.

Figure 1
Figure 1. Figure 1: Training dynamics of prediction entropy for correctly labeled versus mislabeled samples. Left: Nevus images from the ISIC dataset, comparing correctly labeled samples (ground truth: nevus; given label: nevus) with mislabeled samples (ground truth: nevus; given label: other skin lesion categories). Right: Grade 4 diabetic retinopathy (DR) images from the DeepDRiD dataset. In both cases, correctly labeled sa… view at source ↗
Figure 2
Figure 2. Figure 2: Entropy trajectories for a mislabeled sample (ground truth: nevus; given label: melanoma) and a hard clean sample (ground truth: nevus; given label: nevus). Despite differing label correctness, their entropy curves are nearly indistinguishable. This illustrates that entropy alone cannot reliably distinguish mislabeled data from challenging but clean examples. 2.4. Identifying Mislabeled Data 2.4.1. Signed … view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics analysis through label-prediction alignment patterns over time. Bubble timeline charts illustrate the evolution of three sample categories during training: (a) easy clean samples, (b) hard clean samples, and (c) mislabeled samples. At each time step, • green circles indicate alignment between predicted and given labels, while • red circles denote misalignment. Circle size and opacity enco… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of SEI. The plots depict signed entropy curves over training epochs for easy clean (left), hard clean (middle), and mislabeled (right) samples. Each curve is averaged over 200 samples, and the signed area under each curve corresponds to the SEI. Correctly labeled samples yield larger SEIs than mislabeled ones. incorrectly labeled instances using training loss in a pro￾posed sample sieve framew… view at source ↗
Figure 5
Figure 5. Figure 5: F1 score comparison between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. 4.3.6. Limitation Our work is designed for the standard hard-label setting, where each sample has a single assigned label and the goal is to identify mislabeled instances under this assumption. Settings involving uncertainty, annotato… view at source ↗
Figure 6
Figure 6. Figure 6: presents representative samples from each class across the three datasets employed in our study: ISIC, DeepDRiD, and PANDA. We display one exemplar image per class, organized with rows corresponding to individual datasets and columns representing distinct classes. This visualization facilitates direct comparison of class-specific visual characteristics. The corresponding text prompts utilized for training … view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics of prediction entropy for correctly labeled versus mislabeled samples. Top-left: melanoma images from the ISIC dataset. Top-right: grade 0 diabetic retinopathy (DR) images from the DeepDRiD dataset. Bottom-left: benign glandular epithelium images from the PANDA dataset. Bottom-right: Gleason 5 cancerous epithelium images from the PANDA dataset [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of SEI using melanoma images from the ISIC dataset. The plots show signed entropy curves across training epochs for easy clean (left), hard clean (middle), and mislabeled (right) samples. Each curve is averaged over 200 samples, and the signed area under the curve represents the SEI. Correctly labeled samples consistently exhibit larger SEIs than mislabeled ones. -1.0 -0.5 0.0 0.5 1.0 Signed E… view at source ↗
Figure 9
Figure 9. Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of SEI using Gleason 5 images from the PANDA dataset. F. Additional Results on Downstream Image Classification Beyond F1 score, we also report accuracy and AUC for SCE, M-correction, DivideMix, and ProMix, both with and without SEI, under the same protocol as Section 4.3.4. Results are presented in [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Score distributions for clean and noisy samples on the PANDA dataset. We compare our proposed SEI statistic (bottom row) against the unsigned Shannon entropy integral baseline (top row) for noise rates 𝜂 ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. Each column corresponds to increasing noise levels from left to right. The SEI statistic demonstrates better separation between clean and noisy sample distributions at all noi… view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy comparison under confusion-calibrated noise between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. 0.1 0.2 0.3 0.4 0.5 81.0 87.0 93.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 81.0 87.5 94.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 80.0 86.5 93.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 85.0 90.0 95.0 AUC (%) [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 13
Figure 13. Figure 13: AUC comparison under confusion-calibrated noise between baseline noisy label learning methods (green) and their SEI￾enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: F1 score comparison under symmetric noise between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. 0.1 0.2 0.3 0.4 0.5 68.0 76.0 84.0 Accuracy (%) 0.1 0.2 0.3 0.4 0.5 71.0 78.0 85.0 Accuracy (%) 0.1 0.2 0.3 0.4 0.5 71.0 77.5 84.0 Accuracy (%) 0.1 0.2 0.3 0.4 0.5 72.0 79.5 87.0 Accuracy (%) [PITH_FULL_IMAGE:f… view at source ↗
Figure 15
Figure 15. Figure 15: Accuracy comparison under symmetric noise between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. 0.1 0.2 0.3 0.4 0.5 80.0 87.0 94.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 81.0 87.5 94.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 82.0 88.0 94.0 AUC (%) 0.1 0.2 0.3 0.4 0.5 84.0 89.5 95.0 AUC (%) [PITH_FULL_IMAGE:figures/full_fig_p017… view at source ↗
Figure 16
Figure 16. Figure 16: AUC comparison under symmetric noise between baseline noisy label learning methods (green) and their SEI-enhanced variants (red). From left to right: SCE, M-correction, DivideMix, and ProMix. (5e-3 or 1e-2). This trend mirrors standard classification training, where an appropriate learning rate is needed for effective learning of the base model itself [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
read the original abstract

Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the signed entropy integral (SEI) statistic for mislabeled sample detection. It is grounded in the empirical observation that correctly labeled examples exhibit consistent prediction-entropy decrease over training epochs while mislabeled examples maintain relatively high entropy; SEI integrates both magnitude and sign of this trend. The authors report that SEI achieves state-of-the-art identification performance on four medical imaging datasets spanning modalities and pathologies when used with CLIP backbones, while remaining computationally lightweight. Code is released.

Significance. If the entropy-dynamics observation proves robust, SEI would supply a simple, training-dynamics-based alternative to existing mislabel detectors that does not require auxiliary models or clean validation sets. The public code release is a clear strength. The restriction to CLIP backbones and four medical datasets, however, together with the purely empirical grounding, caps the result’s generality and long-term significance.

major comments (2)
  1. [Abstract] Abstract: the central claim rests on the stated observation that “correctly labeled samples exhibit consistent entropy decrease … while mislabeled samples maintain relatively high entropy.” No derivation, no analysis of dependence on loss, optimizer, schedule, or label-noise fraction, and no examination of the regime in which both classes eventually reach low entropy are supplied; this directly affects whether SEI remains discriminative.
  2. [Experiments] Experiments (implied by performance claims): the SOTA assertion is presented without error bars, without ablation on the sign choice or integration limits inside SEI, and without correction for multiple testing across four datasets; these omissions make it impossible to judge whether the reported outperformance is statistically reliable or dataset-specific.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it included the explicit integral definition of SEI rather than only describing its motivation.
  2. A short limitations paragraph discussing possible failure modes (e.g., late-training entropy collapse for both classes) would improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The work is empirical and focuses on the practical performance of SEI for mislabeled sample detection in medical imaging with CLIP backbones. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim rests on the stated observation that “correctly labeled samples exhibit consistent entropy decrease … while mislabeled samples maintain relatively high entropy.” No derivation, no analysis of dependence on loss, optimizer, schedule, or label-noise fraction, and no examination of the regime in which both classes eventually reach low entropy are supplied; this directly affects whether SEI remains discriminative.

    Authors: The central claim is presented explicitly as an empirical observation from training dynamics on the evaluated datasets, not as a theoretically derived property. The manuscript does not claim universality across all losses, optimizers, or noise levels, and the method is proposed for its utility in the medical imaging setting with CLIP. We did not include analysis of the late-training regime where both classes may reach low entropy because our experiments focus on the epochs where the distinction is observed and useful for detection. We can revise the abstract and add a limitations paragraph clarifying the empirical grounding and scope. revision: partial

  2. Referee: [Experiments] Experiments (implied by performance claims): the SOTA assertion is presented without error bars, without ablation on the sign choice or integration limits inside SEI, and without correction for multiple testing across four datasets; these omissions make it impossible to judge whether the reported outperformance is statistically reliable or dataset-specific.

    Authors: We agree that error bars from repeated runs and ablations on SEI components (sign and integration limits) would strengthen the results and will add them in revision. The four datasets are distinct in modality and pathology; we view the consistent outperformance as supportive rather than requiring formal multiple-testing correction, but we can add a clarifying statement. The SOTA claim is based on the reported metrics outperforming prior methods on these datasets. revision: yes

Circularity Check

0 steps flagged

No circularity; method is an empirical statistic validated on held-out datasets

full rationale

The paper grounds SEI in an empirical observation of entropy trajectories and defines the statistic directly from training dynamics; it then reports experimental performance on four medical datasets. No equations, fitted parameters, or self-citations are shown that reduce the claimed detection performance to the inputs by construction. The derivation chain is therefore self-contained and externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond the SEI definition itself.

pith-pipeline@v0.9.1-grok · 5730 in / 933 out tokens · 17472 ms · 2026-06-28T23:10:38.906490+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references

  1. [1]

    Tackling algorith- mic bias and promoting transparency in health datasets: the standing together consensus recommendations.The Lancet Digital Health, 7:e64–e88, 2025

    Cole-Lewis, H., Glocker, B., et al. Tackling algorith- mic bias and promoting transparency in health datasets: the standing together consensus recommendations.The Lancet Digital Health, 7:e64–e88, 2025

  2. [2]

    Unsupervised label noise modeling and loss cor- rection

    Arazo, E., Ortego, D., Albert, P., O’Connor, N., and McGuin- ness, K. Unsupervised label noise modeling and loss cor- rection. InProceedings of the International Conference on Machine Learning, pp. 312–321, 2019

  3. [3]

    Dirichlet-based per-sample weighting by transition matrix for noisy la- bel learning

    Bae, H., Shin, S., Na, B., and Moon, I. Dirichlet-based per-sample weighting by transition matrix for noisy la- bel learning. InProceedings of the 12th International Conference on Learning Representations, 2024

  4. [4]

    Chen, J., Ramanathan, V., Xu, T., and Martel, A. L. De- tecting noisy labels with repeated cross-validations. In Proceedings of the Medical Image Computing and Com- puter Assisted Intervention, pp. 197–207, 2024

  5. [5]

    Understanding and utilizing deep neural networks trained with noisy labels

    Chen, P., Liao, B., Chen, G., and Zhang, S. Understanding and utilizing deep neural networks trained with noisy labels. InProceedings of the International Conference on Machine Learning, pp. 1062–1070, 2019

  6. [6]

    X., and Mou, L

    Chen, Y., Wang, Q., Li, C., Hu, J., Shi, Y., Xiong, S., Zhu, X. X., and Mou, L. Propl: Universal semi-supervised ultrasound image segmentation via prompt-guided pseudo- labeling. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 3101–3110, 2026

  7. [7]

    Class-dependent label-noise learning with cycle-consistency regularization

    Han, B., and Liu, T. Class-dependent label-noise learning with cycle-consistency regularization. InProceedings of the Advances in Neural Information Processing Systems, pp. 11104–11116, 2022

  8. [8]

    Learning with instance-dependent label noise: A sample sieve approach

    Cheng, H., Zhu, Z., Li, X., Gong, Y., Sun, X., and Liu, Y. Learning with instance-dependent label noise: A sample sieve approach. InProceedings of the 9th International Conference on Learning Representations, 2021

  9. [9]

    Weakly supervised learning with side information for noisy labeled images

    Cheng, L., Zhou, X., Zhao, L., Li, D., Shang, H., Zheng, Y., Pan, P., and Xu, Y. Weakly supervised learning with side information for noisy labeled images. InProceedings of the European Conference on Computer Vision, pp. 306–321, 2020

  10. [10]

    X., and Mou, L

    Dong, L., Tan, Q., Li, C., Hu, J., Shi, Y., Dong, W., Zhu, X. X., and Mou, L. Dual distillation for few-shot anomaly detection. InProceedings of the 14th International Con- ference on Learning Representations, 2026. 9 On Revisiting Entropy for Identifying Mislabeled Images

  11. [11]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the 9th International Conference on Learning Representations, 2021

  12. [12]

    and Azizpour, H

    Englesson, E. and Azizpour, H. Robust classification via regression for learning with noisy labels. InProceed- ings of the 12th International Conference on Learning Representations, 2024

  13. [13]

    Deep Self-Learning from noisy labels

    Han, J., Luo, P., and Wang, X. Deep Self-Learning from noisy labels. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 5138–5147, 2019

  14. [14]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016

  15. [15]

    O2U-Net: A simple noisy label detection approach for deep neural networks

    Huang, J., Qu, L., Jia, R., and Zhao, B. O2U-Net: A simple noisy label detection approach for deep neural networks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3325–3333, 2019

  16. [16]

    N., Lungren, M

    Patel, B. N., Lungren, M. P., and Ng, A. Y. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 590–597, 2019

  17. [17]

    Delving into sample loss curve to embrace noisy and imbalanced data

    Jiang, S., Li, J., Wang, Y., Huang, B., Zhang, Z., and Xu, T. Delving into sample loss curve to embrace noisy and imbalanced data. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 7024–7032, 2022

  18. [18]

    Splitnet: Learnable clean-noisy label splitting for learning with noisy labels

    Kim, D., Ryoo, K., Cho, H., and Kim, S. Splitnet: Learnable clean-noisy label splitting for learning with noisy labels. International Journal of Computer Vision, 133(2):549– 566, 2025

  19. [19]

    Learning discriminative dynamics with label corruption for noisy label detection

    Kim, S., Lee, D., Kang, S., Chae, S., Jang, S., and Yu, H. Learning discriminative dynamics with label corruption for noisy label detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22477–22487, 2024

  20. [20]

    J., and Ronneberger, O

    Rezende, D. J., and Ronneberger, O. A probabilistic U-Net for segmentation of ambiguous images. InProceedings of the Advances in Neural Information Processing Systems, pp. 6965–6975, 2018

  21. [21]

    Scale-aware contrastive reverse distillation for unsupervised medical anomaly detection

    Li, C., Shi, Y., Hu, J., Zhu, X., and Mou, L. Scale-aware contrastive reverse distillation for unsupervised medical anomaly detection. InProceedings of the 13th Interna- tional Conference on Learning Representations, 2025

  22. [22]

    Li, J., Socher, R., and Hoi, S. C. H. DivideMix: Learning with noisy labels as semi-supervised learning. InPro- ceedings of the 8th International Conference on Learning Representations, 2020

  23. [23]

    Ciompi, F., Ghafoorian, M., van der Laak, J. A. W. M., van Ginneken, B., and S ´anchez, C. I. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017

  24. [24]

    Deepdrid: Diabetic retinopathy-grading and image quality estimation challenge.Patterns, 3:100512, 2022

    Jia, W., Shen, D., Sheng, B., and Zhang, P. Deepdrid: Diabetic retinopathy-grading and image quality estimation challenge.Patterns, 3:100512, 2022

  25. [25]

    Early-learning regularization prevents memorization of noisy labels

    Liu, S., Niles-Weed, J., Razavian, N., and Fernandez-Granda, C. Early-learning regularization prevents memorization of noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 20331–20342, 2020

  26. [26]

    K., and Kumar, S

    Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S. Does label smoothing mitigate label noise? In Proceedings of the International Conference on Machine Learning, pp. 6448–6458, 2020

  27. [27]

    Semi-supervised medical image segmentation through dual-task consis- tency

    Luo, X., Chen, J., Song, T., and Wang, G. Semi-supervised medical image segmentation through dual-task consis- tency. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 8801–8809, 2021

  28. [28]

    Difficulty-aware attention network with confidence learning for medical image segmentation

    Nie, D., Wang, L., Xiang, L., Zhou, S., Adeli, E., and Shen, D. Difficulty-aware attention network with confidence learning for medical image segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 1085–1092, 2019

  29. [29]

    G., Jiang, L., and Chuang, I

    Northcutt, C. G., Jiang, L., and Chuang, I. L. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021

  30. [30]

    R., and Weinberger, K

    Pleiss, G., Zhang, T., Elenberg, E. R., and Weinberger, K. Q. Identifying mislabeled data using the area under the margin ranking. InProceedings of the Advances in Neural Information Processing Systems, pp. 17044–17056, 2020. 10 On Revisiting Entropy for Identifying Mislabeled Images

  31. [31]

    Learning transferable visual models from natural language supervision

    Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pp. 8748–8763, 2021

  32. [32]

    Learning to reweight examples for robust deep learning

    Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. InPro- ceedings of the International Conference on Machine Learning, pp. 4334–4343, 2018

  33. [33]

    Imagenet large scale visual recognition challenge

    Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015

  34. [34]

    A survey of label-noise deep learning for medical image analysis.Medical Image Analysis, 95:103166, 2024

    Shi, J., Zhang, K., Guo, C., Yang, Y., Xu, Y., and Wu, J. A survey of label-noise deep learning for medical image analysis.Medical Image Analysis, 95:103166, 2024

  35. [35]

    SELFIE: Refurbishing unclean samples for robust deep learning

    Song, H., Kim, M., and Lee, J. SELFIE: Refurbishing unclean samples for robust deep learning. InProceedings of the International Conference on Machine Learning, pp. 5907–5915, 2019

  36. [36]

    Noisegpt: La- bel noise detection and rectification through probability curvature

    Wang, H., Huang, Z., Lin, Z., and Liu, T. Noisegpt: La- bel noise detection and rectification through probability curvature. InProceedings of the Advances in Neural Information Processing Systems, pp. 120159–120183, 2024

  37. [37]

    Symmetric cross entropy for robust learning with noisy labels

    Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J. Symmetric cross entropy for robust learning with noisy labels. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322–330, 2019

  38. [38]

    Learning with noisy labels revisited: A study using real-world human annotations

    Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., and Liu, Y. Learning with noisy labels revisited: A study using real-world human annotations. InProceedings of the 10th International Conference on Learning Representations, 2022

  39. [39]

    Vision-Language models are strong noisy label detectors

    Wei, T., Li, H., Li, C., Shi, J., Li, Y., and Zhang, M. Vision-Language models are strong noisy label detectors. InProceedings of the Advances in Neural Information Processing Systems, pp. 58154–58173, 2024

  40. [40]

    Robust early-learning: Hindering the memorization of noisy labels

    Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., and Chang, Y. Robust early-learning: Hindering the memorization of noisy labels. InProceedings of the 9th International Conference on Learning Representations, 2021

  41. [41]

    ProMix: Combating label noise via maximizing clean sample utility

    Xiao, R., Dong, Y., Wang, H., Feng, L., Wu, R., Chen, G., and Zhao, J. ProMix: Combating label noise via maximizing clean sample utility. InProceedings of the International Joint Conference on Artificial Intelligence, pp. 4442–4450, 2023

  42. [42]

    Learning from massive noisy labeled data for image classification

    Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. Learning from massive noisy labeled data for image classification. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2691–2699, 2015

  43. [43]

    Learning from noisy labels via dynamic loss thresholding.IEEE Transactions on Knowledge and Data Engineering, 2023

    Zhang, M.-L. Learning from noisy labels via dynamic loss thresholding.IEEE Transactions on Knowledge and Data Engineering, 2023

  44. [44]

    Active negative loss functions for learning with noisy labels

    Ye, X., Li, X., Dai, S., Liu, T., Sun, Y., and Tong, W. Active negative loss functions for learning with noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 6917–6940, 2023

  45. [45]

    Early stopping against label noise without validation data

    Yuan, S., Feng, L., and Liu, T. Early stopping against label noise without validation data. InProceedings of the 12th International Conference on Learning Representations, 2024

  46. [46]

    LEMoN: Label error detection using multimodal neighbors

    Zhang, H., Balagopalan, A., Oufattole, N., Jeong, H., Wu, Y., Zhu, J., and Ghassemi, M. LEMoN: Label error detection using multimodal neighbors. InProceedings of the International Conference on Machine Learning, 2025

  47. [47]

    and Sabuncu, M

    Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. InProceedings of the Advances in Neural Information Processing Systems, pp. 8792–8802, 2018

  48. [48]

    Strong aug

    Zhu, Z., Dong, Z., and Liu, Y. Detecting corrupted labels without training a model to predict. InProceedings of the International Conference on Machine Learning, pp. 27412–27427, 2022. 11 On Revisiting Entropy for Identifying Mislabeled Images A. Datasets Figure 6 presents representative samples from each class across the three datasets employed in our st...