pith. sign in

arxiv: 2605.19956 · v1 · pith:FDZWHDXGnew · submitted 2026-05-19 · 💻 cs.CV

Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

Pith reviewed 2026-05-20 06:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords test-time adaptationadversarial robustnessvision-language modelsprompt tuningattention mechanismfine-grained robustnessCLIP
0
0 comments X

The pith

Refined gradient attention rollout identifies surviving semantic regions to guide test-time prompt tuning in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Attention-Guided Test-Time Prompt Tuning (A-TPT) to protect vision-language models such as CLIP from adversarial attacks at inference time. Standard test-time methods rely on uniform multi-view augmentations that often erase fine-grained discriminative details. By first sharpening the gradient attention rollout to locate regions that remain meaningful after an attack, the approach then applies spatially varying augmentation strengths and ensembles these views to tune prompts. Experiments show gains on both attacked inputs and clean images.

Core claim

A refined gradient attention rollout locates semantically meaningful image regions that persist under adversarial perturbations. These regions then direct the intensity of spatially varying augmentations and support multi-view ensembles during test-time prompt tuning, allowing adaptation while preserving the information needed for accurate classification.

What carries the argument

The refined gradient attention rollout mechanism that identifies semantically meaningful regions surviving under adversarial attacks and uses them to guide spatially varying augmentation intensities for prompt tuning.

If this is right

  • The method yields higher accuracy than prior test-time adaptation techniques on adversarial examples.
  • Performance on unattacked clean data also improves or stays comparable.
  • The approach better suits fine-grained tasks by avoiding destruction of small discriminative regions.
  • Augmentations become semantics-preserving rather than applied uniformly across the image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention-guided principle could be tested on distribution shifts other than adversarial attacks, such as natural corruptions or domain changes.
  • Real-world systems facing possible input manipulations might adopt this form of test-time adaptation to reduce vulnerability without retraining the base model.
  • Extending the rollout refinement to other vision-language architectures beyond CLIP would check whether the robustness gains generalize.

Load-bearing premise

The refined gradient attention rollout can reliably locate regions that stay semantically meaningful after an attack and can steer augmentations without erasing discriminative details.

What would settle it

A controlled test in which attention maps from the rollout show low overlap with regions that actually determine correct classification under attack, or in which the guided augmentations produce lower accuracy than uniform multi-view baselines.

Figures

Figures reproduced from arXiv: 2605.19956 by Jia-Wei Hai, Xiu-Shen Wei, Yijun Wang.

Figure 1
Figure 1. Figure 1: (a) Cosine similarity in the unit circle: adversarially attacked (colored) and original (black) feature vectors are highly divergent; (b) Ratio of true labels among the Top-K predictions under adversarial attacks: the true label of the input is pushed out of the Top-K predictions (ViT-B/16). 2026). However, they exhibit significant degradation under even subtle adversarial perturbations, raising serious sa… view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline of A-TPT. Given an input sample, Attention Refinement based on token-gradient is used to identify semantic parts. Then, Attention-Guided Multi-View Augmentation builds a set of semantics-preserving views for fine-tuning learnable prompts. After selecting low-entropy views followed by prompt tuning, TV-Based Ensemble weights reliable views in the final inference process. GAR takes the first row… view at source ↗
Figure 3
Figure 3. Figure 3: Quality of semantic identification on adversarial examples (Pets dataset). Compared with GAR, our refined attention focuses on continuous and discriminative semantic parts (ViT-B/16) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attention distribution of high-reliable views and low-reliable views from the same test sample on the Pets dataset (ViT-B/16). 0 10 20 30 40 50 60 70 80 90 16 viwes 32 views 64 viwes 0 Accuracy (%) Views Number Pets Caltech101 Cars DTD UCF101 EuroSAT Flower102 Aircraft [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Adversarial accuracy with the various numbers of aug￾mented views (ViT-B/16). Guided Test-Time Prompt Tuning (A-TPT). Inspired by feature corruption, we first decoupled semantic identifica￾tion from the training stage and leveraged the unperturbed semantic information under adversarial attacks. We found that existing gradient attention is sensitive to adversarial attacks and can introduce random attention … view at source ↗
read the original abstract

Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Attention-Guided Test-Time Prompt Tuning (A-TPT) for vision-language models such as CLIP. It refines the gradient attention rollout mechanism to locate semantically meaningful regions that survive adversarial attacks, then uses these regions to guide spatially varying augmentation intensities and multi-view ensembles during test-time prompt tuning and inference. The central claim is that this semantics-preserving approach yields superior performance compared to existing test-time adaptation methods on both adversarial and clean data.

Significance. If the refined attention mechanism reliably identifies attack-persistent semantic regions and the guidance improves adaptation without discarding discriminative information, the method could advance practical robustness for VLMs in fine-grained settings where standard multi-view augmentations fail. The empirical nature of the proposal means significance hinges on whether performance gains can be attributed to the attention component rather than prompt tuning in general.

major comments (2)
  1. [§3.2] §3.2: The refined gradient attention rollout is asserted to identify semantically meaningful regions that survive under adversarial attacks, yet the manuscript supplies no quantitative validation such as overlap metrics, correlation coefficients, or stability scores between attention maps computed on clean images and their adversarially perturbed counterparts. This assumption directly underpins the augmentation guidance in §3.3 and the claim that the method is semantics-preserving.
  2. [Experiments] Experiments: The abstract states that A-TPT outperforms existing methods on both adversarial and clean data, but the manuscript does not report dataset details, specific attack configurations, baseline implementations, or ablations that isolate the contribution of attention-guided augmentations from the underlying prompt-tuning procedure. Without these, it is difficult to assess whether the central empirical claim holds.
minor comments (2)
  1. [Abstract] Abstract: The claim of outperformance would be strengthened by including at least one key quantitative result (e.g., accuracy delta on a standard benchmark) rather than a purely qualitative statement.
  2. [§3.3] Notation: The description of 'spatially varying augmentation intensities' in §3.3 would benefit from an explicit equation or pseudocode defining how attention values modulate the augmentation parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and strengthen the empirical support.

read point-by-point responses
  1. Referee: [§3.2] The refined gradient attention rollout is asserted to identify semantically meaningful regions that survive under adversarial attacks, yet the manuscript supplies no quantitative validation such as overlap metrics, correlation coefficients, or stability scores between attention maps computed on clean images and their adversarially perturbed counterparts. This assumption directly underpins the augmentation guidance in §3.3 and the claim that the method is semantics-preserving.

    Authors: We agree that quantitative validation would strengthen the justification for using the refined attention rollout. The current manuscript provides qualitative visualizations showing that attention maps remain focused on semantically relevant regions post-attack. In the revision, we will add quantitative metrics including IoU overlap, Pearson correlation, and stability scores between clean and adversarial attention maps, computed across the evaluation datasets. These will be reported in Section 3.2 or a dedicated appendix to directly support the semantics-preserving claim. revision: yes

  2. Referee: The abstract states that A-TPT outperforms existing methods on both adversarial and clean data, but the manuscript does not report dataset details, specific attack configurations, baseline implementations, or ablations that isolate the contribution of attention-guided augmentations from the underlying prompt-tuning procedure. Without these, it is difficult to assess whether the central empirical claim holds.

    Authors: We acknowledge that additional experimental details and ablations are needed for full reproducibility and to isolate the attention-guidance contribution. Section 4 currently summarizes the setup, but we will expand it with: complete dataset specifications and splits, precise attack parameters (e.g., PGD epsilon, iteration counts), baseline re-implementation details, and new ablation studies comparing A-TPT against vanilla test-time prompt tuning without attention-guided augmentations or ensembles. These additions will clarify the source of the reported gains on both adversarial and clean data. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical method with external validation

full rationale

The paper presents A-TPT as an empirical proposal: it refines an existing gradient attention rollout technique to guide augmentations during test-time prompt tuning and validates the approach via accuracy improvements on standard adversarial and clean benchmarks. No equations, derivations, or predictions are shown to reduce by construction to fitted inputs or self-citations. The method description relies on standard attention mechanisms and prompt-tuning procedures without load-bearing self-referential steps or uniqueness claims imported from prior author work. The central performance claims rest on external experimental results rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the approach rests on standard assumptions about attention mechanisms in VLMs and the utility of test-time prompt tuning. No new entities are postulated.

axioms (1)
  • domain assumption Gradient attention rollout can highlight semantically meaningful regions that remain stable under adversarial perturbations
    This is the core premise used to guide augmentation intensities and ensemble weighting.

pith-pipeline@v0.9.0 · 5714 in / 1150 out tokens · 42377 ms · 2026-05-20T06:37:44.855891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

  1. [1]

    Learning

    Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack , booktitle=. Learning

  2. [2]

    Shin, Gyungin and Xie, Weidi and Albanie, Samuel , booktitle=NIPS, pages=. Re

  3. [3]

    Zhang, Hao and Li, Feng and Zou, Xueyan and Liu, Shilong and Li, Chunyuan and Yang, Jianwei and Zhang, Lei , booktitle=. A

  4. [4]

    Wei, Yongxian and Wei, Xiu-Shen , title =. Mach. Intell. Res. , year =

  5. [5]

    Hong-Tao Yu and Xiu-Shen Wei and Yuxin Peng and Serge Belongie , title =. Proc. Int. Conf. Learn. Representations , year =

  6. [6]

    Zhou, Yiwei and Xia, Xiaobo and Lin, Zhiwei and Han, Bo and Liu, Tongliang , booktitle=

  7. [7]

    Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , booktitle=

  8. [8]

    Advances in Neural Inf

    Test-time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models , author=. Advances in Neural Inf. Process. Syst. , pages=

  9. [9]

    Sheng, Lijun and Liang, Jian and Wang, Zilei and He, Ran , booktitle=. R-

  10. [10]

    On the test-time zero-shot generalization of vision-language models:

    Zanella, Maxime and Ben Ayed, Ismail , booktitle=. On the test-time zero-shot generalization of vision-language models:

  11. [11]

    2024 , volume =

    Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , title =. 2024 , volume =

  12. [12]

    One prompt word is enough to boost adversarial robustness for pre-trained vision-language models , author=. Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages=

  13. [13]

    Shuoyuan Wang and Yixuan Li and Hongxin Wei , booktitle=

  14. [14]

    Tan, Baofeng and Wei, Xiu-Shen and Zhao, Lin , booktitle=

  15. [15]

    Li, Haoxin and Li, Boyang , booktitle=

  16. [16]

    Pattern Recogn

    Ye, Shuo and Peng, Qinmu and Cheung, Yiu-ming and Wang, Yu and Zou, Ziqian and You, Xinge , title =. Pattern Recogn. , year =

  17. [17]

    Text-guided attention is all you need for zero-shot robustness in vision-language models , author=

  18. [18]

    Zhang, Jiaming and Yi, Qi and Sang, Jitao , booktitle=ICM, pages=

  19. [19]

    Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Li, Chongxuan and Cheung, Ngai-Man Man and Lin, Min , booktitle=NIPS, pages=

  20. [20]

    Szegedy, Christian and Zaremba, Wojciech and Sutskever, Ilya and Bruna, Joan and Erhan, Dumitru and Goodfellow, Ian and Fergus, Rob , booktitle = ICLR, pages=

  21. [21]

    Jiaming Zhang and Qi Yi and Jitao Sang , year =

  22. [22]

    and Shlens, Jonathon and Szegedy, Christian , booktitle = ICLR, pages =

    Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian , booktitle = ICLR, pages =

  23. [23]

    Yin, Ziyi and Ye, Muchao and Zhang, Tianrong and Du, Tianyu and Zhu, Jinguo and Liu, Han and Chen, Jinghui and Wang, Ting and Ma, Fenglong , booktitle = NIPS, pages =

  24. [24]

    Gupta, Saurav and Lakhotia, Sourav and Rawat, Abhay and Tallamraju, Rahul , booktitle = CVPR, pages=

  25. [25]

    Wang, Sibo and Zhang, Jie and Yuan, Zheng and Shan, Shiguang , booktitle=CVPR, pages=

  26. [26]

    Fei-Fei, Li , booktitle=CVPR, pages=

  27. [27]

    Parkhi, Omkar M and Vedaldi, Andrea and Zisserman, Andrew and Jawahar, CV , booktitle=CVPR, pages=

  28. [28]

    Nilsback, Maria-Elena and Zisserman, Andrew , booktitle=ICCV, pages=

  29. [29]

    Maji, Subhransu and Rahtu, Esa and Kannala, Juho and Blaschko, Matthew and Vedaldi, Andrea , journal=

  30. [30]

    Describing textures in the wild , author=

  31. [31]

    2019 , volume =

    Helber, Patrick and Bischke, Benjamin and Dengel, Andreas and Borth, Damian , title =. 2019 , volume =

  32. [32]

    Center for Research in Computer Vision , year =

    Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak , title =. Center for Research in Computer Vision , year =

  33. [33]

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, pages=

  34. [34]

    Tong, Baoshun and Lai, Hanjiang and Pan, Yan and Yin, Jian , booktitle=CVPR, pages=

  35. [35]

    and Zoph, Barret and Gilmer, Justin and Lakshminarayanan, Balaji , booktitle =

    Hendrycks, Dan and Mu, Norman and Cubuk, Ekin D. and Zoph, Barret and Gilmer, Justin and Lakshminarayanan, Balaji , booktitle =. Aug

  36. [36]

    Madry, Aleksander and Makelov, Aleksandar and Schmidt, Ludwig and Tsipras, Dimitris and Vladu, Adrian , booktitle = ICLR, pages=. Towards

  37. [37]

    2024 , volume =

    Yifan Pu and Yizeng Han and Yulin Wang and Junlan Feng and Chao Deng and Gao Huang , title =. 2024 , volume =

  38. [38]

    Enhancing

    Wang, Yeyuan and Gao, Dehong and Yi, Lei and Jin, Linbo and Zhang, Jinxia and Yang, Libin and Cai, Xiaoyan , booktitle = AAAI, pages =. Enhancing

  39. [39]

    Yoon, Hee Suk and Yoon, Eunseop and Tee, Joshua Tian Jin and Hasegawa-Johnson, Mark and Li, Yingzhen and Yoo, Chang D , booktitle=ICLR, pages =. C-

  40. [40]

    Clip is strong enough to fight back:

    Xing, Songlong and Zhao, Zhengyu and Sebe, Nicu , booktitle=CVPR, pages=. Clip is strong enough to fight back:

  41. [41]

    2024 , pages =

    Li, Lin and Guan, Haoyan and Qiu, Jianing and Spratling, Michael , title =. 2024 , pages =

  42. [42]

    2025 , pages =

    Hossain, Md Zarif and Imteaj, Ahmed , title =. 2025 , pages =

  43. [43]

    Dong, Junhao and Zhang, Cong and Qu, Xinghua and Ma, Zejun and Koniusz, Piotr and Ong, Yew-Soon , booktitle = NIPS, year =. Robust

  44. [44]

    Wang, Xin and Chen, Kai and Zhang, Jiaming and Chen, Jingjing and Ma, Xingjun , booktitle=CVPR, pages=

  45. [45]

    Pu, Yifan and Han, Yizeng and Wang, Yulin and Feng, Junlan and Deng, Chao and Huang, Gao , journal=

  46. [46]

    Chefer, Hila and Gur, Shir and Wolf, Lior , booktitle=CVPR, pages=

  47. [47]

    Cui, Xuanming and Aparcedo, Alejandro and Jang, Young Kyun and Lim, Ser-Nam , booktitle=CVPR, pages=

  48. [48]

    Nie, Weili and Guo, Brandon and Huang, Yujia and Xiao, Chaowei and Vahdat, Arash and Anandkumar, Anima , booktitle=ICML, pages=

  49. [49]

    Beyond pretrained features:

    You, Zunzhi and Liu, Daochang and Han, Bohyung and Xu, Chang , booktitle=NIPS, pages=. Beyond pretrained features:

  50. [50]

    Align your prompts:

    Abdul Samadh, Jameel and Gani, Mohammad Hanan and Hussein, Noor and Khattak, Muhammad Uzair and Naseer, Muhammad Muzammal and Shahbaz Khan, Fahad and Khan, Salman H , booktitle=. Align your prompts:

  51. [51]

    Just shift it:

    Sui, Elaine and Wang, Xiaohan and Yeung-Levy, Serena , booktitle=WACV, pages=. Just shift it:

  52. [52]

    Jia, Menglin and Tang, Luming and Chen, Bor-Chun and Cardie, Claire and Belongie, Serge and Hariharan, Bharath and Lim, Ser-Nam , booktitle = ECCV, pages =. Visual

  53. [53]

    Croce, Francesco and Hein, Matthias , booktitle=ICML, pages=

  54. [54]

    Carlini, Nicholas and Wagner, David , booktitle=

  55. [55]

    Moosavi-Dezfooli, Seyed-Mohsen and Fawzi, Alhussein and Frossard, Pascal , booktitle=CVPR, pages=. Deep

  56. [56]

    Krause, Jonathan and Stark, Michael and Deng, Jia and Fei-Fei, Li , booktitle=

  57. [57]

    Bossard, Lukas and Guillaumin, Matthieu and Van Gool, Luc , title =

  58. [58]

    and Oliva, Aude and Torralba, Antonio , booktitle=CVPR, title=

    Xiao, Jianxiong and Hays, James and Ehinger, Krista A. and Oliva, Aude and Torralba, Antonio , booktitle=CVPR, title=. 2010 , pages=

  59. [59]

    Understanding

    Chengzhi Mao and Scott Geng and Junfeng Yang and Xin Wang and Carl Vondrick , booktitle=ICLR, year=. Understanding

  60. [60]

    Schlarmann, Christian and Singh, Naman Deep and Croce, Francesco and Hein, Matthias , booktitle=ICML, pages=. Robust

  61. [61]

    Adversarial robustness:

    Chen, Tianlong and Liu, Sijia and Chang, Shiyu and Cheng, Yu and Amini, Lisa and Wang, Zhangyang , booktitle=CVPR, pages=. Adversarial robustness:

  62. [62]

    2024 , pages=

    Islam, Khawar and Zaheer, Muhammad Zaigham and Mahmood, Arif and Nandakumar, Karthik , booktitle=CVPR, title=. 2024 , pages=

  63. [63]

    2025 , pages =

    Li, Haoxin and Li, Boyang , title =. 2025 , pages =

  64. [64]

    2024 , pages =

    Hu, Feiran and Zhang, Chenlin and Guo, Jiangliang and Wei, Xiu-Shen and Zhao, Lin and Xu, Anqi and Gao, Lingyan , title =. 2024 , pages =

  65. [65]

    2025 , volume =

    Yang, Suorong and Li, Peijia and Xiong, Xin and Shen, Furao and Zhao, Jian , title =. 2025 , volume =

  66. [66]

    Wang, Yulin and Huang, Gao and Song, Shiji and Pan, Xuran and Xia, Yitong and Wu, Cheng , journal=

  67. [67]

    Eyal Michaeli and Ohad Fried , booktitle=

  68. [68]

    Diverse data augmentation with diffusions for effective test-time prompt tuning , author=. Proc. IEEE Int. Conf. Comp. Vis. , pages=

  69. [69]

    Enhancing fine-grained vision-language pretraining with negative augmented samples , author=. Proc. Conf. AAAI , pages=

  70. [70]

    Advances in Neural Inf

    Implicit semantic data augmentation for deep networks , author=. Advances in Neural Inf. Process. Syst. , pages=

  71. [71]

    Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and LI, Chongxuan and Cheung, Ngai-Man (Man) and Lin, Min , booktitle =

  72. [72]

    Mach. Intell. Res. , volume =. 2023 , author =

  73. [73]

    Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , booktitle=CVPR, pages=. Natural

  74. [74]

    Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal , booktitle=ICML, pages=. Do. 2019 , organization=

  75. [75]

    Hendrycks, Dan and Basart, Steven and Mu, Norman and Kadavath, Saurav and Wang, Frank and Dorundo, Evan and Desai, Rahul and Zhu, Tyler and Parajuli, Samyak and Guo, Mike and others , booktitle=. The

  76. [76]

    Learning

    Wang, Haohan and Ge, Songwei and Lipton, Zachary and Xing, Eric P , booktitle =. Learning

  77. [77]

    Chinese Journal of Electronics , volume =

    Towards. Chinese Journal of Electronics , volume =. 2026 , author =

  78. [78]

    Chinese Journal of Electronics , volume =

    Enhancing the. Chinese Journal of Electronics , volume =. 2025 , author =

  79. [79]

    2024 , author =

    FSCIL-EACA: Few-Shot Class-Incremental Learning Network Based on Embedding Augmentation and Classifier Adaptation for Image Classification , journal =. 2024 , author =