pith. sign in

arxiv: 2605.00591 · v1 · submitted 2026-05-01 · 💻 cs.CV

Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

Pith reviewed 2026-05-09 19:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords prompt tuninglabel noisevision-language modelsCLIPgradient suppressionrobust adaptationdouble softmax
0
0 comments X

The pith

Double-Softmax Prompt Tuning applies sequential normalization to suppress gradients from noisy labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that prompt tuning on models like CLIP is vulnerable to label noise because mislabeled samples produce oversized gradients that override the pre-trained knowledge. Since the initial model is already near optimal, the tuning process benefits from built-in conservatism that limits extreme updates. DSPT achieves this by performing two softmax normalizations in sequence, which automatically creates a saturation zone that reduces the learning signal from high-error examples. This approach repurposes what is usually seen as a training drawback into an automatic noise filter. The result is a simple method that maintains performance on clean data while improving reliability when labels contain errors.

Core claim

Double-Softmax Prompt Tuning performs sequential probabilistic normalization on the output probabilities. This produces a self-adaptive saturation zone in the gradient flow that automatically reduces the magnitude of updates coming from high-error noisy samples while preserving updates from lower-error samples.

What carries the argument

Double-Softmax Prompt Tuning, which uses sequential probabilistic normalization to induce a self-adaptive saturation zone that filters noisy gradients during prompt adaptation.

If this is right

  • Prompt tuning for vision-language models becomes robust to label noise without any additional hyperparameters or architectural changes.
  • The method reaches state-of-the-art accuracy on multiple noisy benchmarks while remaining a simple drop-in replacement.
  • Gradient vanishing is converted from a training obstacle into a built-in mechanism that shields against noisy samples.
  • Both theoretical analysis and experiments demonstrate that the saturation zone adapts to the error level of each sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sequential normalization idea could be tested in other fine-tuning regimes where strong pre-trained models meet noisy supervision.
  • If the saturation effect scales with model size, larger vision-language models might show even stronger automatic noise resistance.
  • Applying the double-softmax layer to the loss rather than only the output probabilities might extend the protection to other training objectives.

Load-bearing premise

CLIP already provides a near-optimal initialization, so adaptation must remain conservative and avoid extreme gradient steps triggered by noisy labels.

What would settle it

A direct comparison on a standard noisy-label benchmark in which ordinary prompt tuning reaches equal or higher accuracy than DSPT would show that the claimed saturation-based suppression does not deliver the stated robustness gain.

Figures

Figures reproduced from arXiv: 2605.00591 by Jiaqiang Huang, Jiaxin Qi, Jiayu Li, Sheng Zhou, Xiansheng Hua.

Figure 1
Figure 1. Figure 1: Accuracy curve of prompt-tuning(CoOp), our method, and zero-shot predictions with increasing noise rate under different settings. CoOp suffers from a significant performance drop and falls below zero-shot predictions, while our double-softmax cross￾entropy loss yields consistent noise robustness. alignment enables zero-shot image classification, where class names are inserted into prompts (e.g., “a photo o… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed method. (a) CLIP backbone model for prompt-tuning. (b) We observe that prompt-tuning generates gradient surges in samples with confident predictions inconsistent with noisy labels with respect to output logit terms, which are likely to be mislabeled and will therefore harm the training process. (c) Our double-softmax cross-entropy loss nullifies the influence of mismatch samples by… view at source ↗
Figure 4
Figure 4. Figure 4: Loss curves of correctly labeled and mislabeled samples in the training process on Caltech101. ing process. Additional studies in the appendix confirm that this phenomenon is consistent throughout the early stage of the training process. In addition, we record the average loss among samples with correct and incorrect labels in the first 20 epochs in the training process, with 40% symmetric and pair-flip no… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy of LogitClip affected by τ compared to ours. consistently in diverse scenarios. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional studies on the accuracy curve of prompt-tuning(CoOp), our method, and zero-shot predictions with increasing noise rate under different settings on UCF101 and EuroSAT datasets. 0 5 10 15 20 epoch 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Average Gradient corr-samples incorr-samples (a) CoOp on Caltech101 0 5 10 15 20 epoch 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Average Gradient corr-samples incorr-samples (b) … view at source ↗
Figure 7
Figure 7. Figure 7: Studies on the gradient curve of prompt-tuning(CoOp) and our method with respect to CLIP’s output logits on Caltech101 and DTD datasets with 60% symmetric noise • DTD: A texture recognition dataset with 47 classes of textual images. The training size of this dataset is 2820, and the testing size is 1692. • EuroSAT: A fine-grained satellite recognition dataset containing 10 types of different landscapes, wi… view at source ↗
Figure 8
Figure 8. Figure 8: Additional studies on the effects of hyperparameter τ of LogitClip compared to ours on UCF101 dataset. 0 5 10 15 20 epoch 0 2 4 6 Average loss corr-samples incorr-samples (a) CoOp 0 5 10 15 20 epoch 0 2 4 6 Average loss corr-samples incorr-samples (b) DSPT [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional studies on the accuracy curve of prompt-tuning(CoOp) and our method in the early training stage on Catltech101 dataset with 80% symmetric noise [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that prompt tuning in contrastive vision-language models like CLIP is highly sensitive to label noise, as mislabeled samples generate large gradients that overwhelm pre-trained priors. It proposes Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method using sequential probabilistic normalization to induce a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates from clean ones. The authors provide theoretical analysis and empirical evidence that this mechanism achieves adaptive suppression, repurposing gradient vanishing as a noise-filtering shield, and report state-of-the-art robustness across noisy benchmarks.

Significance. If validated, the result would be significant as a simple, intrinsic, hyperparameter-free baseline for robust prompt tuning that directly leverages CLIP's zero-shot initialization without added architectures or tuning. The conceptual reframing of gradient vanishing as a principled filter could influence practical noisy-label adaptation in vision-language models, provided the mechanism reliably separates noise from informative high-loss clean samples.

major comments (3)
  1. [Abstract] Abstract: The load-bearing claim that the saturation zone 'suppresses gradients from high-error noisy samples while maintaining informative updates' assumes loss magnitude is a reliable proxy for label noise. This may not hold for clean samples poorly aligned with CLIP priors (high loss despite correct labels), violating the 'maintains informative updates' guarantee. The manuscript must address this scenario explicitly, e.g., via targeted experiments on distribution-shifted clean data.
  2. [Theoretical analysis section] Theoretical analysis section: The abstract asserts 'both theoretical analysis and empirical evidence' on adaptive suppression, yet no equations, proof sketches, or definitions of the sequential probabilistic normalization and saturation zone are visible. Without these, it cannot be verified whether the suppression effect follows directly from the normalization or reduces to a fitted scaling factor, undermining the 'hyperparameter-free' and 'intrinsic' claims.
  3. [Experiments section] Experiments section: The SOTA robustness claim across 'various noisy benchmarks' is central but lacks detail on controls for the skeptic's concern (clean high-loss samples suppressed). If benchmarks only include in-distribution clean data, the results do not falsify the failure mode where the mechanism harms informative updates.
minor comments (1)
  1. [Abstract] The abstract could briefly name the specific benchmarks and noise rates used to substantiate the SOTA claim and allow immediate assessment of scope.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important considerations regarding the assumptions underlying our method and the strength of our validation. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The load-bearing claim that the saturation zone 'suppresses gradients from high-error noisy samples while maintaining informative updates' assumes loss magnitude is a reliable proxy for label noise. This may not hold for clean samples poorly aligned with CLIP priors (high loss despite correct labels), violating the 'maintains informative updates' guarantee. The manuscript must address this scenario explicitly, e.g., via targeted experiments on distribution-shifted clean data.

    Authors: We agree that loss magnitude is not a perfect proxy and that clean samples with high loss due to misalignment with CLIP priors represent a valid edge case. In the revised manuscript we will add targeted experiments on distribution-shifted clean data (e.g., ImageNet variants with controlled shifts) to quantify the behavior of the saturation zone on such samples. These results will demonstrate that the self-adaptive mechanism primarily saturates extreme outliers while still permitting informative gradient updates from high-loss but correctly labeled examples. revision: yes

  2. Referee: [Theoretical analysis section] Theoretical analysis section: The abstract asserts 'both theoretical analysis and empirical evidence' on adaptive suppression, yet no equations, proof sketches, or definitions of the sequential probabilistic normalization and saturation zone are visible. Without these, it cannot be verified whether the suppression effect follows directly from the normalization or reduces to a fitted scaling factor, undermining the 'hyperparameter-free' and 'intrinsic' claims.

    Authors: The theoretical analysis appears in Section 3, where we define sequential probabilistic normalization and derive the resulting saturation zone. To address the concern that these elements may not be sufficiently prominent, we will expand the section with explicit equations for the double-softmax gradient, a concise proof sketch showing the adaptive saturation property, and a direct comparison to simple scaling to confirm the effect is intrinsic to the normalization rather than an added hyperparameter. revision: partial

  3. Referee: [Experiments section] Experiments section: The SOTA robustness claim across 'various noisy benchmarks' is central but lacks detail on controls for the skeptic's concern (clean high-loss samples suppressed). If benchmarks only include in-distribution clean data, the results do not falsify the failure mode where the mechanism harms informative updates.

    Authors: We acknowledge the need for explicit controls against the failure mode of suppressing informative high-loss clean samples. In addition to the existing noisy-label benchmarks, the revised experiments section will include ablations and controls on clean but distribution-shifted data to verify that DSPT preserves useful updates. These additions will directly test the skeptic's concern and strengthen the empirical support for the adaptive suppression claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; DSPT mechanism derived from explicit normalization design

full rationale

The paper defines DSPT via a concrete sequential probabilistic normalization (double-softmax) and then derives its saturation-zone suppression effect from the resulting gradient magnitudes. This is a standard forward derivation from the chosen functional form rather than a self-referential loop. The CLIP near-optimality premise is stated as an explicit modeling assumption, not obtained by fitting or by self-citation. No equations reduce the claimed noise-filtering property to a fitted hyperparameter or to a prior result whose only support is the present work. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pre-trained CLIP weights are near-optimal and that large gradient steps from noisy labels are therefore harmful; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption CLIP already provides a near-optimal initialization for prompt tuning
    Explicitly stated as the basis for advocating conservative adaptation.

pith-pipeline@v0.9.0 · 5480 in / 1015 out tokens · 29428 ms · 2026-05-09T19:01:07.894709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references

  1. [1]

    FirstName LastName , title =

  2. [2]

    FirstName Alpher , title =

  3. [3]

    Journal of Foo , volume = 13, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

  4. [4]

    Journal of Foo , volume = 14, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

  5. [5]

    FirstName Alpher and FirstName Gamow , title =

  6. [6]

    Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , booktitle =

    Li Fei. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , booktitle =

  7. [7]

    3D Object Representations for Fine-Grained Categorization , booktitle =

    Jonathan Krause and Michael Stark and Jia Deng and Li Fei. 3D Object Representations for Fine-Grained Categorization , booktitle =

  8. [8]

    Parkhi and Andrea Vedaldi and Andrew Zisserman and C

    Omkar M. Parkhi and Andrea Vedaldi and Andrew Zisserman and C. V. Jawahar , title =. 2012

  9. [9]

    Automated Flower Classification over a Large Number of Classes , booktitle =

    Maria. Automated Flower Classification over a Large Number of Classes , booktitle =

  10. [10]

    Food-101 - Mining Discriminative Components with Random Forests , booktitle =

    Lukas Bossard and Matthieu Guillaumin and Luc Van Gool , editor =. Food-101 - Mining Discriminative Components with Random Forests , booktitle =

  11. [11]

    Blaschko and Andrea Vedaldi , title =

    Subhransu Maji and Esa Rahtu and Juho Kannala and Matthew B. Blaschko and Andrea Vedaldi , title =. CoRR , volume =. 2013 , eprinttype =

  12. [12]

    Mircea Cimpoi and Subhransu Maji and Iasonas Kokkinos and Sammy Mohamed and Andrea Vedaldi , title =. 2014

  13. [13]

    Patrick Helber and Benjamin Bischke and Andreas Dengel and Damian Borth , title =

  14. [14]

    CoRR , volume =

    Khurram Soomro and Amir Roshan Zamir and Mubarak Shah , title =. CoRR , volume =. 2012 , eprinttype =

  15. [15]

    Kaiyang Zhou and Jingkang Yang and Chen Change Loy and Ziwei Liu , title =. Int. J. Comput. Vis. , volume =

  16. [16]

    Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? , booktitle =

    Cheng. Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? , booktitle =

  17. [17]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , booktitle =

    Adam Paszke and Sam Gross and Francisco Massa and Adam Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Zeming Lin and Natalia Gimelshein and Luca Antiga and Alban Desmaison and Andreas K. PyTorch: An Imperative Style, High-Performance Deep Learning Library , booktitle =

  18. [18]

    Learning Transferable Visual Models From Natural Language Supervision , booktitle =

    Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , editor =. Learning Transferable Visual Models From Natural Language Supervision , booktitle =

  19. [19]

    Kaiyang Zhou and Jingkang Yang and Chen Change Loy and Ziwei Liu , title =

  20. [20]

    Junnan Li and Dongxu Li and Caiming Xiong and Steven C. H. Hoi , editor =. International Conference on Machine Learning,

  21. [21]

    CoRR , volume =

    Tony Huang and Jack Chu and Fangyun Wei , title =. CoRR , volume =. 2022 , doi =

  22. [22]

    Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models , booktitle =

    Manli Shu and Weili Nie and De. Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models , booktitle =

  23. [23]

    Mitigating Memorization of Noisy Labels by Clipping the Model Prediction , booktitle =

    Hongxin Wei and Huiping Zhuang and Renchunzi Xie and Lei Feng and Gang Niu and Bo An and Yixuan Li , editor =. Mitigating Memorization of Noisy Labels by Clipping the Model Prediction , booktitle =

  24. [24]

    Mitigating Neural Network Overconfidence with Logit Normalization , booktitle =

    Hongxin Wei and Renchunzi Xie and Hao Cheng and Lei Feng and Bo An and Yixuan Li , editor =. Mitigating Neural Network Overconfidence with Logit Normalization , booktitle =

  25. [25]

    2024 , doi =

    Yuncheng Guo and Xiaodong Gu , title =. 2024 , doi =

  26. [26]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

    Chaowei Fang and Hangfei Ma and Zhihao Li and De Cheng and Yue Zhang and Guanbin Li , title =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , doi =

  27. [27]

    Bikang Pan and Qun Li and Xiaoying Tang and Wei Huang and Zhen Fang and Feng Liu and Jingya Wang and Jingyi Yu and Ye Shi , title =

  28. [28]

    Junnan Li and Richard Socher and Steven C. H. Hoi , title =. 8th International Conference on Learning Representations,

  29. [29]

    Deep Patel and P. S. Sastry , title =

  30. [30]

    Haobo Wang and Ruixuan Xiao and Yixuan Li and Lei Feng and Gang Niu and Gang Chen and Junbo Zhao , title =

  31. [31]

    Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting , booktitle =

    Jun Shu and Qi Xie and Lixuan Yi and Qian Zhao and Sanping Zhou and Zongben Xu and Deyu Meng , editor =. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting , booktitle =

  32. [32]

    Aritra Ghosh and Himanshu Kumar and P. S. Sastry , editor =. Robust Loss Functions under Label Noise for Deep Neural Networks , booktitle =

  33. [33]

    Erfani and James Bailey , title =

    Xingjun Ma and Hanxun Huang and Yisen Wang and Simone Romano and Sarah M. Erfani and James Bailey , title =. Proceedings of the 37th International Conference on Machine Learning,

  34. [34]

    Proceedings of the 37th International Conference on Machine Learning,

    Michal Lukasik and Srinadh Bhojanapalli and Aditya Krishna Menon and Sanjiv Kumar , title =. Proceedings of the 37th International Conference on Machine Learning,

  35. [35]

    Reed and Honglak Lee and Dragomir Anguelov and Christian Szegedy and Dumitru Erhan and Andrew Rabinovich , editor =

    Scott E. Reed and Honglak Lee and Dragomir Anguelov and Christian Szegedy and Dumitru Erhan and Andrew Rabinovich , editor =. Training Deep Neural Networks on Noisy Labels with Bootstrapping , booktitle =

  36. [36]

    9th International Conference on Learning Representations,

    Xiaobo Xia and Tongliang Liu and Bo Han and Chen Gong and Nannan Wang and Zongyuan Ge and Yi Chang , title =. 9th International Conference on Learning Representations,

  37. [37]

    Asymmetric Loss Functions for Learning with Noisy Labels , booktitle =

    Xiong Zhou and Xianming Liu and Junjun Jiang and Xin Gao and Xiangyang Ji , editor =. Asymmetric Loss Functions for Learning with Noisy Labels , booktitle =

  38. [38]

    Sabuncu , editor =

    Zhilu Zhang and Mert R. Sabuncu , editor =. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , booktitle =

  39. [39]

    Metaxas and Chao Chen , title =

    Songzhu Zheng and Pengxiang Wu and Aman Goswami and Mayank Goswami and Dimitris N. Metaxas and Chao Chen , title =. Proceedings of the 37th International Conference on Machine Learning,

  40. [40]

    Yu Yao and Tongliang Liu and Bo Han and Mingming Gong and Jiankang Deng and Gang Niu and Masashi Sugiyama , editor =. Dual. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , year =

  41. [41]

    Dumais , title =

    Guoqing Zheng and Ahmed Hassan Awadallah and Susan T. Dumais , title =. Thirty-Fifth

  42. [42]

    9th International Conference on Learning Representations,

    Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,

  43. [43]

    Beyond Class-Conditional Assumption:

    Pengfei Chen and Junjie Ye and Guangyong Chen and Jingwei Zhao and Pheng. Beyond Class-Conditional Assumption:. Thirty-Fifth

  44. [44]

    Are Anchor Points Really Indispensable in Label-Noise Learning? , booktitle =

    Xiaobo Xia and Tongliang Liu and Nannan Wang and Bo Han and Chen Gong and Gang Niu and Masashi Sugiyama , editor =. Are Anchor Points Really Indispensable in Label-Noise Learning? , booktitle =

  45. [45]

    Khan and Fahad Shahbaz Khan , title =

    Muhammad Uzair Khattak and Hanoona Abdul Rasheed and Muhammad Maaz and Salman H. Khan and Fahad Shahbaz Khan , title =

  46. [46]

    Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , title =. 2016

  47. [47]

    Handbook of Systemic Autoimmune Diseases , volume=

    Learning multiple layers of features from tiny images , author=. Handbook of Systemic Autoimmune Diseases , volume=

  48. [48]

    mixup: Beyond Empirical Risk Minimization , booktitle =

    Hongyi Zhang and Moustapha Ciss. mixup: Beyond Empirical Risk Minimization , booktitle =