Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models
Pith reviewed 2026-05-20 06:37 UTC · model grok-4.3
The pith
Refined gradient attention rollout identifies surviving semantic regions to guide test-time prompt tuning in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A refined gradient attention rollout locates semantically meaningful image regions that persist under adversarial perturbations. These regions then direct the intensity of spatially varying augmentations and support multi-view ensembles during test-time prompt tuning, allowing adaptation while preserving the information needed for accurate classification.
What carries the argument
The refined gradient attention rollout mechanism that identifies semantically meaningful regions surviving under adversarial attacks and uses them to guide spatially varying augmentation intensities for prompt tuning.
If this is right
- The method yields higher accuracy than prior test-time adaptation techniques on adversarial examples.
- Performance on unattacked clean data also improves or stays comparable.
- The approach better suits fine-grained tasks by avoiding destruction of small discriminative regions.
- Augmentations become semantics-preserving rather than applied uniformly across the image.
Where Pith is reading between the lines
- The same attention-guided principle could be tested on distribution shifts other than adversarial attacks, such as natural corruptions or domain changes.
- Real-world systems facing possible input manipulations might adopt this form of test-time adaptation to reduce vulnerability without retraining the base model.
- Extending the rollout refinement to other vision-language architectures beyond CLIP would check whether the robustness gains generalize.
Load-bearing premise
The refined gradient attention rollout can reliably locate regions that stay semantically meaningful after an attack and can steer augmentations without erasing discriminative details.
What would settle it
A controlled test in which attention maps from the rollout show low overlap with regions that actually determine correct classification under attack, or in which the guided augmentations produce lower accuracy than uniform multi-view baselines.
Figures
read the original abstract
Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Attention-Guided Test-Time Prompt Tuning (A-TPT) for vision-language models such as CLIP. It refines the gradient attention rollout mechanism to locate semantically meaningful regions that survive adversarial attacks, then uses these regions to guide spatially varying augmentation intensities and multi-view ensembles during test-time prompt tuning and inference. The central claim is that this semantics-preserving approach yields superior performance compared to existing test-time adaptation methods on both adversarial and clean data.
Significance. If the refined attention mechanism reliably identifies attack-persistent semantic regions and the guidance improves adaptation without discarding discriminative information, the method could advance practical robustness for VLMs in fine-grained settings where standard multi-view augmentations fail. The empirical nature of the proposal means significance hinges on whether performance gains can be attributed to the attention component rather than prompt tuning in general.
major comments (2)
- [§3.2] §3.2: The refined gradient attention rollout is asserted to identify semantically meaningful regions that survive under adversarial attacks, yet the manuscript supplies no quantitative validation such as overlap metrics, correlation coefficients, or stability scores between attention maps computed on clean images and their adversarially perturbed counterparts. This assumption directly underpins the augmentation guidance in §3.3 and the claim that the method is semantics-preserving.
- [Experiments] Experiments: The abstract states that A-TPT outperforms existing methods on both adversarial and clean data, but the manuscript does not report dataset details, specific attack configurations, baseline implementations, or ablations that isolate the contribution of attention-guided augmentations from the underlying prompt-tuning procedure. Without these, it is difficult to assess whether the central empirical claim holds.
minor comments (2)
- [Abstract] Abstract: The claim of outperformance would be strengthened by including at least one key quantitative result (e.g., accuracy delta on a standard benchmark) rather than a purely qualitative statement.
- [§3.3] Notation: The description of 'spatially varying augmentation intensities' in §3.3 would benefit from an explicit equation or pseudocode defining how attention values modulate the augmentation parameters.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and strengthen the empirical support.
read point-by-point responses
-
Referee: [§3.2] The refined gradient attention rollout is asserted to identify semantically meaningful regions that survive under adversarial attacks, yet the manuscript supplies no quantitative validation such as overlap metrics, correlation coefficients, or stability scores between attention maps computed on clean images and their adversarially perturbed counterparts. This assumption directly underpins the augmentation guidance in §3.3 and the claim that the method is semantics-preserving.
Authors: We agree that quantitative validation would strengthen the justification for using the refined attention rollout. The current manuscript provides qualitative visualizations showing that attention maps remain focused on semantically relevant regions post-attack. In the revision, we will add quantitative metrics including IoU overlap, Pearson correlation, and stability scores between clean and adversarial attention maps, computed across the evaluation datasets. These will be reported in Section 3.2 or a dedicated appendix to directly support the semantics-preserving claim. revision: yes
-
Referee: The abstract states that A-TPT outperforms existing methods on both adversarial and clean data, but the manuscript does not report dataset details, specific attack configurations, baseline implementations, or ablations that isolate the contribution of attention-guided augmentations from the underlying prompt-tuning procedure. Without these, it is difficult to assess whether the central empirical claim holds.
Authors: We acknowledge that additional experimental details and ablations are needed for full reproducibility and to isolate the attention-guidance contribution. Section 4 currently summarizes the setup, but we will expand it with: complete dataset specifications and splits, precise attack parameters (e.g., PGD epsilon, iteration counts), baseline re-implementation details, and new ablation studies comparing A-TPT against vanilla test-time prompt tuning without attention-guided augmentations or ensembles. These additions will clarify the source of the reported gains on both adversarial and clean data. revision: yes
Circularity Check
No circularity detected; empirical method with external validation
full rationale
The paper presents A-TPT as an empirical proposal: it refines an existing gradient attention rollout technique to guide augmentations during test-time prompt tuning and validates the approach via accuracy improvements on standard adversarial and clean benchmarks. No equations, derivations, or predictions are shown to reduce by construction to fitted inputs or self-citations. The method description relies on standard attention mechanisms and prompt-tuning procedures without load-bearing self-referential steps or uniqueness claims imported from prior author work. The central performance claims rest on external experimental results rather than internal definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient attention rollout can highlight semantically meaningful regions that remain stable under adversarial perturbations
Reference graph
Works this paper leans on
- [1]
-
[2]
Shin, Gyungin and Xie, Weidi and Albanie, Samuel , booktitle=NIPS, pages=. Re
-
[3]
Zhang, Hao and Li, Feng and Zou, Xueyan and Liu, Shilong and Li, Chunyuan and Yang, Jianwei and Zhang, Lei , booktitle=. A
-
[4]
Wei, Yongxian and Wei, Xiu-Shen , title =. Mach. Intell. Res. , year =
-
[5]
Hong-Tao Yu and Xiu-Shen Wei and Yuxin Peng and Serge Belongie , title =. Proc. Int. Conf. Learn. Representations , year =
-
[6]
Zhou, Yiwei and Xia, Xiaobo and Lin, Zhiwei and Han, Bo and Liu, Tongliang , booktitle=
-
[7]
Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , booktitle=
-
[8]
Test-time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models , author=. Advances in Neural Inf. Process. Syst. , pages=
-
[9]
Sheng, Lijun and Liang, Jian and Wang, Zilei and He, Ran , booktitle=. R-
-
[10]
On the test-time zero-shot generalization of vision-language models:
Zanella, Maxime and Ben Ayed, Ismail , booktitle=. On the test-time zero-shot generalization of vision-language models:
-
[11]
Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , title =. 2024 , volume =
work page 2024
-
[12]
One prompt word is enough to boost adversarial robustness for pre-trained vision-language models , author=. Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages=
-
[13]
Shuoyuan Wang and Yixuan Li and Hongxin Wei , booktitle=
-
[14]
Tan, Baofeng and Wei, Xiu-Shen and Zhao, Lin , booktitle=
-
[15]
Li, Haoxin and Li, Boyang , booktitle=
-
[16]
Ye, Shuo and Peng, Qinmu and Cheung, Yiu-ming and Wang, Yu and Zou, Ziqian and You, Xinge , title =. Pattern Recogn. , year =
-
[17]
Text-guided attention is all you need for zero-shot robustness in vision-language models , author=
-
[18]
Zhang, Jiaming and Yi, Qi and Sang, Jitao , booktitle=ICM, pages=
-
[19]
Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Li, Chongxuan and Cheung, Ngai-Man Man and Lin, Min , booktitle=NIPS, pages=
-
[20]
Szegedy, Christian and Zaremba, Wojciech and Sutskever, Ilya and Bruna, Joan and Erhan, Dumitru and Goodfellow, Ian and Fergus, Rob , booktitle = ICLR, pages=
-
[21]
Jiaming Zhang and Qi Yi and Jitao Sang , year =
-
[22]
and Shlens, Jonathon and Szegedy, Christian , booktitle = ICLR, pages =
Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian , booktitle = ICLR, pages =
-
[23]
Yin, Ziyi and Ye, Muchao and Zhang, Tianrong and Du, Tianyu and Zhu, Jinguo and Liu, Han and Chen, Jinghui and Wang, Ting and Ma, Fenglong , booktitle = NIPS, pages =
-
[24]
Gupta, Saurav and Lakhotia, Sourav and Rawat, Abhay and Tallamraju, Rahul , booktitle = CVPR, pages=
-
[25]
Wang, Sibo and Zhang, Jie and Yuan, Zheng and Shan, Shiguang , booktitle=CVPR, pages=
-
[26]
Fei-Fei, Li , booktitle=CVPR, pages=
-
[27]
Parkhi, Omkar M and Vedaldi, Andrea and Zisserman, Andrew and Jawahar, CV , booktitle=CVPR, pages=
-
[28]
Nilsback, Maria-Elena and Zisserman, Andrew , booktitle=ICCV, pages=
-
[29]
Maji, Subhransu and Rahtu, Esa and Kannala, Juho and Blaschko, Matthew and Vedaldi, Andrea , journal=
-
[30]
Describing textures in the wild , author=
-
[31]
Helber, Patrick and Bischke, Benjamin and Dengel, Andreas and Borth, Damian , title =. 2019 , volume =
work page 2019
-
[32]
Center for Research in Computer Vision , year =
Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak , title =. Center for Research in Computer Vision , year =
-
[33]
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, pages=
-
[34]
Tong, Baoshun and Lai, Hanjiang and Pan, Yan and Yin, Jian , booktitle=CVPR, pages=
-
[35]
and Zoph, Barret and Gilmer, Justin and Lakshminarayanan, Balaji , booktitle =
Hendrycks, Dan and Mu, Norman and Cubuk, Ekin D. and Zoph, Barret and Gilmer, Justin and Lakshminarayanan, Balaji , booktitle =. Aug
-
[36]
Madry, Aleksander and Makelov, Aleksandar and Schmidt, Ludwig and Tsipras, Dimitris and Vladu, Adrian , booktitle = ICLR, pages=. Towards
-
[37]
Yifan Pu and Yizeng Han and Yulin Wang and Junlan Feng and Chao Deng and Gao Huang , title =. 2024 , volume =
work page 2024
- [38]
-
[39]
Yoon, Hee Suk and Yoon, Eunseop and Tee, Joshua Tian Jin and Hasegawa-Johnson, Mark and Li, Yingzhen and Yoo, Chang D , booktitle=ICLR, pages =. C-
-
[40]
Clip is strong enough to fight back:
Xing, Songlong and Zhao, Zhengyu and Sebe, Nicu , booktitle=CVPR, pages=. Clip is strong enough to fight back:
-
[41]
Li, Lin and Guan, Haoyan and Qiu, Jianing and Spratling, Michael , title =. 2024 , pages =
work page 2024
- [42]
-
[43]
Dong, Junhao and Zhang, Cong and Qu, Xinghua and Ma, Zejun and Koniusz, Piotr and Ong, Yew-Soon , booktitle = NIPS, year =. Robust
-
[44]
Wang, Xin and Chen, Kai and Zhang, Jiaming and Chen, Jingjing and Ma, Xingjun , booktitle=CVPR, pages=
-
[45]
Pu, Yifan and Han, Yizeng and Wang, Yulin and Feng, Junlan and Deng, Chao and Huang, Gao , journal=
-
[46]
Chefer, Hila and Gur, Shir and Wolf, Lior , booktitle=CVPR, pages=
-
[47]
Cui, Xuanming and Aparcedo, Alejandro and Jang, Young Kyun and Lim, Ser-Nam , booktitle=CVPR, pages=
-
[48]
Nie, Weili and Guo, Brandon and Huang, Yujia and Xiao, Chaowei and Vahdat, Arash and Anandkumar, Anima , booktitle=ICML, pages=
-
[49]
You, Zunzhi and Liu, Daochang and Han, Bohyung and Xu, Chang , booktitle=NIPS, pages=. Beyond pretrained features:
-
[50]
Abdul Samadh, Jameel and Gani, Mohammad Hanan and Hussein, Noor and Khattak, Muhammad Uzair and Naseer, Muhammad Muzammal and Shahbaz Khan, Fahad and Khan, Salman H , booktitle=. Align your prompts:
-
[51]
Sui, Elaine and Wang, Xiaohan and Yeung-Levy, Serena , booktitle=WACV, pages=. Just shift it:
-
[52]
Jia, Menglin and Tang, Luming and Chen, Bor-Chun and Cardie, Claire and Belongie, Serge and Hariharan, Bharath and Lim, Ser-Nam , booktitle = ECCV, pages =. Visual
-
[53]
Croce, Francesco and Hein, Matthias , booktitle=ICML, pages=
-
[54]
Carlini, Nicholas and Wagner, David , booktitle=
-
[55]
Moosavi-Dezfooli, Seyed-Mohsen and Fawzi, Alhussein and Frossard, Pascal , booktitle=CVPR, pages=. Deep
-
[56]
Krause, Jonathan and Stark, Michael and Deng, Jia and Fei-Fei, Li , booktitle=
-
[57]
Bossard, Lukas and Guillaumin, Matthieu and Van Gool, Luc , title =
-
[58]
and Oliva, Aude and Torralba, Antonio , booktitle=CVPR, title=
Xiao, Jianxiong and Hays, James and Ehinger, Krista A. and Oliva, Aude and Torralba, Antonio , booktitle=CVPR, title=. 2010 , pages=
work page 2010
-
[59]
Chengzhi Mao and Scott Geng and Junfeng Yang and Xin Wang and Carl Vondrick , booktitle=ICLR, year=. Understanding
-
[60]
Schlarmann, Christian and Singh, Naman Deep and Croce, Francesco and Hein, Matthias , booktitle=ICML, pages=. Robust
-
[61]
Chen, Tianlong and Liu, Sijia and Chang, Shiyu and Cheng, Yu and Amini, Lisa and Wang, Zhangyang , booktitle=CVPR, pages=. Adversarial robustness:
-
[62]
Islam, Khawar and Zaheer, Muhammad Zaigham and Mahmood, Arif and Nandakumar, Karthik , booktitle=CVPR, title=. 2024 , pages=
work page 2024
- [63]
-
[64]
Hu, Feiran and Zhang, Chenlin and Guo, Jiangliang and Wei, Xiu-Shen and Zhao, Lin and Xu, Anqi and Gao, Lingyan , title =. 2024 , pages =
work page 2024
-
[65]
Yang, Suorong and Li, Peijia and Xiong, Xin and Shen, Furao and Zhao, Jian , title =. 2025 , volume =
work page 2025
-
[66]
Wang, Yulin and Huang, Gao and Song, Shiji and Pan, Xuran and Xia, Yitong and Wu, Cheng , journal=
-
[67]
Eyal Michaeli and Ohad Fried , booktitle=
-
[68]
Diverse data augmentation with diffusions for effective test-time prompt tuning , author=. Proc. IEEE Int. Conf. Comp. Vis. , pages=
-
[69]
Enhancing fine-grained vision-language pretraining with negative augmented samples , author=. Proc. Conf. AAAI , pages=
-
[70]
Implicit semantic data augmentation for deep networks , author=. Advances in Neural Inf. Process. Syst. , pages=
-
[71]
Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and LI, Chongxuan and Cheung, Ngai-Man (Man) and Lin, Min , booktitle =
-
[72]
Mach. Intell. Res. , volume =. 2023 , author =
work page 2023
-
[73]
Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , booktitle=CVPR, pages=. Natural
-
[74]
Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal , booktitle=ICML, pages=. Do. 2019 , organization=
work page 2019
-
[75]
Hendrycks, Dan and Basart, Steven and Mu, Norman and Kadavath, Saurav and Wang, Frank and Dorundo, Evan and Desai, Rahul and Zhu, Tyler and Parajuli, Samyak and Guo, Mike and others , booktitle=. The
- [76]
-
[77]
Chinese Journal of Electronics , volume =
Towards. Chinese Journal of Electronics , volume =. 2026 , author =
work page 2026
-
[78]
Chinese Journal of Electronics , volume =
Enhancing the. Chinese Journal of Electronics , volume =. 2025 , author =
work page 2025
-
[79]
FSCIL-EACA: Few-Shot Class-Incremental Learning Network Based on Embedding Augmentation and Classifier Adaptation for Image Classification , journal =. 2024 , author =
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.