Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

Jia-Wei Hai; Xiu-Shen Wei; Yijun Wang

arxiv: 2605.19956 · v1 · pith:FDZWHDXGnew · submitted 2026-05-19 · 💻 cs.CV

Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

Jia-Wei Hai , Yijun Wang , Xiu-Shen Wei This is my paper

Pith reviewed 2026-05-20 06:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords test-time adaptationadversarial robustnessvision-language modelsprompt tuningattention mechanismfine-grained robustnessCLIP

0 comments

The pith

Refined gradient attention rollout identifies surviving semantic regions to guide test-time prompt tuning in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Attention-Guided Test-Time Prompt Tuning (A-TPT) to protect vision-language models such as CLIP from adversarial attacks at inference time. Standard test-time methods rely on uniform multi-view augmentations that often erase fine-grained discriminative details. By first sharpening the gradient attention rollout to locate regions that remain meaningful after an attack, the approach then applies spatially varying augmentation strengths and ensembles these views to tune prompts. Experiments show gains on both attacked inputs and clean images.

Core claim

A refined gradient attention rollout locates semantically meaningful image regions that persist under adversarial perturbations. These regions then direct the intensity of spatially varying augmentations and support multi-view ensembles during test-time prompt tuning, allowing adaptation while preserving the information needed for accurate classification.

What carries the argument

The refined gradient attention rollout mechanism that identifies semantically meaningful regions surviving under adversarial attacks and uses them to guide spatially varying augmentation intensities for prompt tuning.

If this is right

The method yields higher accuracy than prior test-time adaptation techniques on adversarial examples.
Performance on unattacked clean data also improves or stays comparable.
The approach better suits fine-grained tasks by avoiding destruction of small discriminative regions.
Augmentations become semantics-preserving rather than applied uniformly across the image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention-guided principle could be tested on distribution shifts other than adversarial attacks, such as natural corruptions or domain changes.
Real-world systems facing possible input manipulations might adopt this form of test-time adaptation to reduce vulnerability without retraining the base model.
Extending the rollout refinement to other vision-language architectures beyond CLIP would check whether the robustness gains generalize.

Load-bearing premise

The refined gradient attention rollout can reliably locate regions that stay semantically meaningful after an attack and can steer augmentations without erasing discriminative details.

What would settle it

A controlled test in which attention maps from the rollout show low overlap with regions that actually determine correct classification under attack, or in which the guided augmentations produce lower accuracy than uniform multi-view baselines.

Figures

Figures reproduced from arXiv: 2605.19956 by Jia-Wei Hai, Xiu-Shen Wei, Yijun Wang.

**Figure 1.** Figure 1: (a) Cosine similarity in the unit circle: adversarially attacked (colored) and original (black) feature vectors are highly divergent; (b) Ratio of true labels among the Top-K predictions under adversarial attacks: the true label of the input is pushed out of the Top-K predictions (ViT-B/16). 2026). However, they exhibit significant degradation under even subtle adversarial perturbations, raising serious sa… view at source ↗

**Figure 2.** Figure 2: The pipeline of A-TPT. Given an input sample, Attention Refinement based on token-gradient is used to identify semantic parts. Then, Attention-Guided Multi-View Augmentation builds a set of semantics-preserving views for fine-tuning learnable prompts. After selecting low-entropy views followed by prompt tuning, TV-Based Ensemble weights reliable views in the final inference process. GAR takes the first row… view at source ↗

**Figure 3.** Figure 3: Quality of semantic identification on adversarial examples (Pets dataset). Compared with GAR, our refined attention focuses on continuous and discriminative semantic parts (ViT-B/16) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Attention distribution of high-reliable views and low-reliable views from the same test sample on the Pets dataset (ViT-B/16). 0 10 20 30 40 50 60 70 80 90 16 viwes 32 views 64 viwes 0 Accuracy (%) Views Number Pets Caltech101 Cars DTD UCF101 EuroSAT Flower102 Aircraft [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Adversarial accuracy with the various numbers of augmented views (ViT-B/16). Guided Test-Time Prompt Tuning (A-TPT). Inspired by feature corruption, we first decoupled semantic identification from the training stage and leveraged the unperturbed semantic information under adversarial attacks. We found that existing gradient attention is sensitive to adversarial attacks and can introduce random attention … view at source ↗

read the original abstract

Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A-TPT refines attention rollout to guide spatially varying augmentations in test-time prompt tuning for VLMs, with experiments showing gains on adversarial and clean data, but the key assumption about attention stability lacks direct validation.

read the letter

The main takeaway is that this paper introduces A-TPT to make test-time prompt tuning more robust for vision-language models like CLIP on fine-grained tasks. It refines gradient attention rollout to locate semantic regions that supposedly survive attacks, then uses those regions to vary augmentation intensity across the image and to shape a multi-view ensemble during tuning and inference. The abstract and results claim consistent outperformance over existing test-time adaptation baselines on both adversarial and clean accuracy.

Referee Report

2 major / 2 minor

Summary. The paper proposes Attention-Guided Test-Time Prompt Tuning (A-TPT) for vision-language models such as CLIP. It refines the gradient attention rollout mechanism to locate semantically meaningful regions that survive adversarial attacks, then uses these regions to guide spatially varying augmentation intensities and multi-view ensembles during test-time prompt tuning and inference. The central claim is that this semantics-preserving approach yields superior performance compared to existing test-time adaptation methods on both adversarial and clean data.

Significance. If the refined attention mechanism reliably identifies attack-persistent semantic regions and the guidance improves adaptation without discarding discriminative information, the method could advance practical robustness for VLMs in fine-grained settings where standard multi-view augmentations fail. The empirical nature of the proposal means significance hinges on whether performance gains can be attributed to the attention component rather than prompt tuning in general.

major comments (2)

[§3.2] §3.2: The refined gradient attention rollout is asserted to identify semantically meaningful regions that survive under adversarial attacks, yet the manuscript supplies no quantitative validation such as overlap metrics, correlation coefficients, or stability scores between attention maps computed on clean images and their adversarially perturbed counterparts. This assumption directly underpins the augmentation guidance in §3.3 and the claim that the method is semantics-preserving.
[Experiments] Experiments: The abstract states that A-TPT outperforms existing methods on both adversarial and clean data, but the manuscript does not report dataset details, specific attack configurations, baseline implementations, or ablations that isolate the contribution of attention-guided augmentations from the underlying prompt-tuning procedure. Without these, it is difficult to assess whether the central empirical claim holds.

minor comments (2)

[Abstract] Abstract: The claim of outperformance would be strengthened by including at least one key quantitative result (e.g., accuracy delta on a standard benchmark) rather than a purely qualitative statement.
[§3.3] Notation: The description of 'spatially varying augmentation intensities' in §3.3 would benefit from an explicit equation or pseudocode defining how attention values modulate the augmentation parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and strengthen the empirical support.

read point-by-point responses

Referee: [§3.2] The refined gradient attention rollout is asserted to identify semantically meaningful regions that survive under adversarial attacks, yet the manuscript supplies no quantitative validation such as overlap metrics, correlation coefficients, or stability scores between attention maps computed on clean images and their adversarially perturbed counterparts. This assumption directly underpins the augmentation guidance in §3.3 and the claim that the method is semantics-preserving.

Authors: We agree that quantitative validation would strengthen the justification for using the refined attention rollout. The current manuscript provides qualitative visualizations showing that attention maps remain focused on semantically relevant regions post-attack. In the revision, we will add quantitative metrics including IoU overlap, Pearson correlation, and stability scores between clean and adversarial attention maps, computed across the evaluation datasets. These will be reported in Section 3.2 or a dedicated appendix to directly support the semantics-preserving claim. revision: yes
Referee: The abstract states that A-TPT outperforms existing methods on both adversarial and clean data, but the manuscript does not report dataset details, specific attack configurations, baseline implementations, or ablations that isolate the contribution of attention-guided augmentations from the underlying prompt-tuning procedure. Without these, it is difficult to assess whether the central empirical claim holds.

Authors: We acknowledge that additional experimental details and ablations are needed for full reproducibility and to isolate the attention-guidance contribution. Section 4 currently summarizes the setup, but we will expand it with: complete dataset specifications and splits, precise attack parameters (e.g., PGD epsilon, iteration counts), baseline re-implementation details, and new ablation studies comparing A-TPT against vanilla test-time prompt tuning without attention-guided augmentations or ensembles. These additions will clarify the source of the reported gains on both adversarial and clean data. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical method with external validation

full rationale

The paper presents A-TPT as an empirical proposal: it refines an existing gradient attention rollout technique to guide augmentations during test-time prompt tuning and validates the approach via accuracy improvements on standard adversarial and clean benchmarks. No equations, derivations, or predictions are shown to reduce by construction to fitted inputs or self-citations. The method description relies on standard attention mechanisms and prompt-tuning procedures without load-bearing self-referential steps or uniqueness claims imported from prior author work. The central performance claims rest on external experimental results rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the approach rests on standard assumptions about attention mechanisms in VLMs and the utility of test-time prompt tuning. No new entities are postulated.

axioms (1)

domain assumption Gradient attention rollout can highlight semantically meaningful regions that remain stable under adversarial perturbations
This is the core premise used to guide augmentation intensities and ensemble weighting.

pith-pipeline@v0.9.0 · 5714 in / 1150 out tokens · 42377 ms · 2026-05-20T06:37:44.855891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

[1]

Learning

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack , booktitle=. Learning

work page
[2]

Shin, Gyungin and Xie, Weidi and Albanie, Samuel , booktitle=NIPS, pages=. Re

work page
[3]

Zhang, Hao and Li, Feng and Zou, Xueyan and Liu, Shilong and Li, Chunyuan and Yang, Jianwei and Zhang, Lei , booktitle=. A

work page
[4]

Wei, Yongxian and Wei, Xiu-Shen , title =. Mach. Intell. Res. , year =

work page
[5]

Hong-Tao Yu and Xiu-Shen Wei and Yuxin Peng and Serge Belongie , title =. Proc. Int. Conf. Learn. Representations , year =

work page
[6]

Zhou, Yiwei and Xia, Xiaobo and Lin, Zhiwei and Han, Bo and Liu, Tongliang , booktitle=

work page
[7]

Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , booktitle=

work page
[8]

Advances in Neural Inf

Test-time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models , author=. Advances in Neural Inf. Process. Syst. , pages=

work page
[9]

Sheng, Lijun and Liang, Jian and Wang, Zilei and He, Ran , booktitle=. R-

work page
[10]

On the test-time zero-shot generalization of vision-language models:

Zanella, Maxime and Ben Ayed, Ismail , booktitle=. On the test-time zero-shot generalization of vision-language models:

work page
[11]

2024 , volume =

Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , title =. 2024 , volume =

work page 2024
[12]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models , author=. Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages=

work page
[13]

Shuoyuan Wang and Yixuan Li and Hongxin Wei , booktitle=

work page
[14]

Tan, Baofeng and Wei, Xiu-Shen and Zhao, Lin , booktitle=

work page
[15]

Li, Haoxin and Li, Boyang , booktitle=

work page
[16]

Pattern Recogn

Ye, Shuo and Peng, Qinmu and Cheung, Yiu-ming and Wang, Yu and Zou, Ziqian and You, Xinge , title =. Pattern Recogn. , year =

work page
[17]

Text-guided attention is all you need for zero-shot robustness in vision-language models , author=

work page
[18]

Zhang, Jiaming and Yi, Qi and Sang, Jitao , booktitle=ICM, pages=

work page
[19]

Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Li, Chongxuan and Cheung, Ngai-Man Man and Lin, Min , booktitle=NIPS, pages=

work page
[20]

Szegedy, Christian and Zaremba, Wojciech and Sutskever, Ilya and Bruna, Joan and Erhan, Dumitru and Goodfellow, Ian and Fergus, Rob , booktitle = ICLR, pages=

work page
[21]

Jiaming Zhang and Qi Yi and Jitao Sang , year =

work page
[22]

and Shlens, Jonathon and Szegedy, Christian , booktitle = ICLR, pages =

Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian , booktitle = ICLR, pages =

work page
[23]

Yin, Ziyi and Ye, Muchao and Zhang, Tianrong and Du, Tianyu and Zhu, Jinguo and Liu, Han and Chen, Jinghui and Wang, Ting and Ma, Fenglong , booktitle = NIPS, pages =

work page
[24]

Gupta, Saurav and Lakhotia, Sourav and Rawat, Abhay and Tallamraju, Rahul , booktitle = CVPR, pages=

work page
[25]

Wang, Sibo and Zhang, Jie and Yuan, Zheng and Shan, Shiguang , booktitle=CVPR, pages=

work page
[26]

Fei-Fei, Li , booktitle=CVPR, pages=

work page
[27]

Parkhi, Omkar M and Vedaldi, Andrea and Zisserman, Andrew and Jawahar, CV , booktitle=CVPR, pages=

work page
[28]

Nilsback, Maria-Elena and Zisserman, Andrew , booktitle=ICCV, pages=

work page
[29]

Maji, Subhransu and Rahtu, Esa and Kannala, Juho and Blaschko, Matthew and Vedaldi, Andrea , journal=

work page
[30]

Describing textures in the wild , author=

work page
[31]

2019 , volume =

Helber, Patrick and Bischke, Benjamin and Dengel, Andreas and Borth, Damian , title =. 2019 , volume =

work page 2019
[32]

Center for Research in Computer Vision , year =

Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak , title =. Center for Research in Computer Vision , year =

work page
[33]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, pages=

work page
[34]

Tong, Baoshun and Lai, Hanjiang and Pan, Yan and Yin, Jian , booktitle=CVPR, pages=

work page
[35]

and Zoph, Barret and Gilmer, Justin and Lakshminarayanan, Balaji , booktitle =

Hendrycks, Dan and Mu, Norman and Cubuk, Ekin D. and Zoph, Barret and Gilmer, Justin and Lakshminarayanan, Balaji , booktitle =. Aug

work page
[36]

Madry, Aleksander and Makelov, Aleksandar and Schmidt, Ludwig and Tsipras, Dimitris and Vladu, Adrian , booktitle = ICLR, pages=. Towards

work page
[37]

2024 , volume =

Yifan Pu and Yizeng Han and Yulin Wang and Junlan Feng and Chao Deng and Gao Huang , title =. 2024 , volume =

work page 2024
[38]

Enhancing

Wang, Yeyuan and Gao, Dehong and Yi, Lei and Jin, Linbo and Zhang, Jinxia and Yang, Libin and Cai, Xiaoyan , booktitle = AAAI, pages =. Enhancing

work page
[39]

Yoon, Hee Suk and Yoon, Eunseop and Tee, Joshua Tian Jin and Hasegawa-Johnson, Mark and Li, Yingzhen and Yoo, Chang D , booktitle=ICLR, pages =. C-

work page
[40]

Clip is strong enough to fight back:

Xing, Songlong and Zhao, Zhengyu and Sebe, Nicu , booktitle=CVPR, pages=. Clip is strong enough to fight back:

work page
[41]

2024 , pages =

Li, Lin and Guan, Haoyan and Qiu, Jianing and Spratling, Michael , title =. 2024 , pages =

work page 2024
[42]

2025 , pages =

Hossain, Md Zarif and Imteaj, Ahmed , title =. 2025 , pages =

work page 2025
[43]

Dong, Junhao and Zhang, Cong and Qu, Xinghua and Ma, Zejun and Koniusz, Piotr and Ong, Yew-Soon , booktitle = NIPS, year =. Robust

work page
[44]

Wang, Xin and Chen, Kai and Zhang, Jiaming and Chen, Jingjing and Ma, Xingjun , booktitle=CVPR, pages=

work page
[45]

Pu, Yifan and Han, Yizeng and Wang, Yulin and Feng, Junlan and Deng, Chao and Huang, Gao , journal=

work page
[46]

Chefer, Hila and Gur, Shir and Wolf, Lior , booktitle=CVPR, pages=

work page
[47]

Cui, Xuanming and Aparcedo, Alejandro and Jang, Young Kyun and Lim, Ser-Nam , booktitle=CVPR, pages=

work page
[48]

Nie, Weili and Guo, Brandon and Huang, Yujia and Xiao, Chaowei and Vahdat, Arash and Anandkumar, Anima , booktitle=ICML, pages=

work page
[49]

Beyond pretrained features:

You, Zunzhi and Liu, Daochang and Han, Bohyung and Xu, Chang , booktitle=NIPS, pages=. Beyond pretrained features:

work page
[50]

Align your prompts:

Abdul Samadh, Jameel and Gani, Mohammad Hanan and Hussein, Noor and Khattak, Muhammad Uzair and Naseer, Muhammad Muzammal and Shahbaz Khan, Fahad and Khan, Salman H , booktitle=. Align your prompts:

work page
[51]

Just shift it:

Sui, Elaine and Wang, Xiaohan and Yeung-Levy, Serena , booktitle=WACV, pages=. Just shift it:

work page
[52]

Jia, Menglin and Tang, Luming and Chen, Bor-Chun and Cardie, Claire and Belongie, Serge and Hariharan, Bharath and Lim, Ser-Nam , booktitle = ECCV, pages =. Visual

work page
[53]

Croce, Francesco and Hein, Matthias , booktitle=ICML, pages=

work page
[54]

Carlini, Nicholas and Wagner, David , booktitle=

work page
[55]

Moosavi-Dezfooli, Seyed-Mohsen and Fawzi, Alhussein and Frossard, Pascal , booktitle=CVPR, pages=. Deep

work page
[56]

Krause, Jonathan and Stark, Michael and Deng, Jia and Fei-Fei, Li , booktitle=

work page
[57]

Bossard, Lukas and Guillaumin, Matthieu and Van Gool, Luc , title =

work page
[58]

and Oliva, Aude and Torralba, Antonio , booktitle=CVPR, title=

Xiao, Jianxiong and Hays, James and Ehinger, Krista A. and Oliva, Aude and Torralba, Antonio , booktitle=CVPR, title=. 2010 , pages=

work page 2010
[59]

Understanding

Chengzhi Mao and Scott Geng and Junfeng Yang and Xin Wang and Carl Vondrick , booktitle=ICLR, year=. Understanding

work page
[60]

Schlarmann, Christian and Singh, Naman Deep and Croce, Francesco and Hein, Matthias , booktitle=ICML, pages=. Robust

work page
[61]

Adversarial robustness:

Chen, Tianlong and Liu, Sijia and Chang, Shiyu and Cheng, Yu and Amini, Lisa and Wang, Zhangyang , booktitle=CVPR, pages=. Adversarial robustness:

work page
[62]

2024 , pages=

Islam, Khawar and Zaheer, Muhammad Zaigham and Mahmood, Arif and Nandakumar, Karthik , booktitle=CVPR, title=. 2024 , pages=

work page 2024
[63]

2025 , pages =

Li, Haoxin and Li, Boyang , title =. 2025 , pages =

work page 2025
[64]

2024 , pages =

Hu, Feiran and Zhang, Chenlin and Guo, Jiangliang and Wei, Xiu-Shen and Zhao, Lin and Xu, Anqi and Gao, Lingyan , title =. 2024 , pages =

work page 2024
[65]

2025 , volume =

Yang, Suorong and Li, Peijia and Xiong, Xin and Shen, Furao and Zhao, Jian , title =. 2025 , volume =

work page 2025
[66]

Wang, Yulin and Huang, Gao and Song, Shiji and Pan, Xuran and Xia, Yitong and Wu, Cheng , journal=

work page
[67]

Eyal Michaeli and Ohad Fried , booktitle=

work page
[68]

Diverse data augmentation with diffusions for effective test-time prompt tuning , author=. Proc. IEEE Int. Conf. Comp. Vis. , pages=

work page
[69]

Enhancing fine-grained vision-language pretraining with negative augmented samples , author=. Proc. Conf. AAAI , pages=

work page
[70]

Advances in Neural Inf

Implicit semantic data augmentation for deep networks , author=. Advances in Neural Inf. Process. Syst. , pages=

work page
[71]

Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and LI, Chongxuan and Cheung, Ngai-Man (Man) and Lin, Min , booktitle =

work page
[72]

Mach. Intell. Res. , volume =. 2023 , author =

work page 2023
[73]

Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , booktitle=CVPR, pages=. Natural

work page
[74]

Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal , booktitle=ICML, pages=. Do. 2019 , organization=

work page 2019
[75]

Hendrycks, Dan and Basart, Steven and Mu, Norman and Kadavath, Saurav and Wang, Frank and Dorundo, Evan and Desai, Rahul and Zhu, Tyler and Parajuli, Samyak and Guo, Mike and others , booktitle=. The

work page
[76]

Learning

Wang, Haohan and Ge, Songwei and Lipton, Zachary and Xing, Eric P , booktitle =. Learning

work page
[77]

Chinese Journal of Electronics , volume =

Towards. Chinese Journal of Electronics , volume =. 2026 , author =

work page 2026
[78]

Chinese Journal of Electronics , volume =

Enhancing the. Chinese Journal of Electronics , volume =. 2025 , author =

work page 2025
[79]

2024 , author =

FSCIL-EACA: Few-Shot Class-Incremental Learning Network Based on Embedding Augmentation and Classifier Adaptation for Image Classification , journal =. 2024 , author =

work page 2024

[1] [1]

Learning

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack , booktitle=. Learning

work page

[2] [2]

Shin, Gyungin and Xie, Weidi and Albanie, Samuel , booktitle=NIPS, pages=. Re

work page

[3] [3]

Zhang, Hao and Li, Feng and Zou, Xueyan and Liu, Shilong and Li, Chunyuan and Yang, Jianwei and Zhang, Lei , booktitle=. A

work page

[4] [4]

Wei, Yongxian and Wei, Xiu-Shen , title =. Mach. Intell. Res. , year =

work page

[5] [5]

Hong-Tao Yu and Xiu-Shen Wei and Yuxin Peng and Serge Belongie , title =. Proc. Int. Conf. Learn. Representations , year =

work page

[6] [6]

Zhou, Yiwei and Xia, Xiaobo and Lin, Zhiwei and Han, Bo and Liu, Tongliang , booktitle=

work page

[7] [7]

Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , booktitle=

work page

[8] [8]

Advances in Neural Inf

Test-time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models , author=. Advances in Neural Inf. Process. Syst. , pages=

work page

[9] [9]

Sheng, Lijun and Liang, Jian and Wang, Zilei and He, Ran , booktitle=. R-

work page

[10] [10]

On the test-time zero-shot generalization of vision-language models:

Zanella, Maxime and Ben Ayed, Ismail , booktitle=. On the test-time zero-shot generalization of vision-language models:

work page

[11] [11]

2024 , volume =

Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , title =. 2024 , volume =

work page 2024

[12] [12]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models , author=. Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages=

work page

[13] [13]

Shuoyuan Wang and Yixuan Li and Hongxin Wei , booktitle=

work page

[14] [14]

Tan, Baofeng and Wei, Xiu-Shen and Zhao, Lin , booktitle=

work page

[15] [15]

Li, Haoxin and Li, Boyang , booktitle=

work page

[16] [16]

Pattern Recogn

Ye, Shuo and Peng, Qinmu and Cheung, Yiu-ming and Wang, Yu and Zou, Ziqian and You, Xinge , title =. Pattern Recogn. , year =

work page

[17] [17]

Text-guided attention is all you need for zero-shot robustness in vision-language models , author=

work page

[18] [18]

Zhang, Jiaming and Yi, Qi and Sang, Jitao , booktitle=ICM, pages=

work page

[19] [19]

Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Li, Chongxuan and Cheung, Ngai-Man Man and Lin, Min , booktitle=NIPS, pages=

work page

[20] [20]

Szegedy, Christian and Zaremba, Wojciech and Sutskever, Ilya and Bruna, Joan and Erhan, Dumitru and Goodfellow, Ian and Fergus, Rob , booktitle = ICLR, pages=

work page

[21] [21]

Jiaming Zhang and Qi Yi and Jitao Sang , year =

work page

[22] [22]

and Shlens, Jonathon and Szegedy, Christian , booktitle = ICLR, pages =

Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian , booktitle = ICLR, pages =

work page

[23] [23]

Yin, Ziyi and Ye, Muchao and Zhang, Tianrong and Du, Tianyu and Zhu, Jinguo and Liu, Han and Chen, Jinghui and Wang, Ting and Ma, Fenglong , booktitle = NIPS, pages =

work page

[24] [24]

Gupta, Saurav and Lakhotia, Sourav and Rawat, Abhay and Tallamraju, Rahul , booktitle = CVPR, pages=

work page

[25] [25]

Wang, Sibo and Zhang, Jie and Yuan, Zheng and Shan, Shiguang , booktitle=CVPR, pages=

work page

[26] [26]

Fei-Fei, Li , booktitle=CVPR, pages=

work page

[27] [27]

Parkhi, Omkar M and Vedaldi, Andrea and Zisserman, Andrew and Jawahar, CV , booktitle=CVPR, pages=

work page

[28] [28]

Nilsback, Maria-Elena and Zisserman, Andrew , booktitle=ICCV, pages=

work page

[29] [29]

Maji, Subhransu and Rahtu, Esa and Kannala, Juho and Blaschko, Matthew and Vedaldi, Andrea , journal=

work page

[30] [30]

Describing textures in the wild , author=

work page

[31] [31]

2019 , volume =

Helber, Patrick and Bischke, Benjamin and Dengel, Andreas and Borth, Damian , title =. 2019 , volume =

work page 2019

[32] [32]

Center for Research in Computer Vision , year =

Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak , title =. Center for Research in Computer Vision , year =

work page

[33] [33]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, pages=

work page

[34] [34]

Tong, Baoshun and Lai, Hanjiang and Pan, Yan and Yin, Jian , booktitle=CVPR, pages=

work page

[35] [35]

and Zoph, Barret and Gilmer, Justin and Lakshminarayanan, Balaji , booktitle =

Hendrycks, Dan and Mu, Norman and Cubuk, Ekin D. and Zoph, Barret and Gilmer, Justin and Lakshminarayanan, Balaji , booktitle =. Aug

work page

[36] [36]

Madry, Aleksander and Makelov, Aleksandar and Schmidt, Ludwig and Tsipras, Dimitris and Vladu, Adrian , booktitle = ICLR, pages=. Towards

work page

[37] [37]

2024 , volume =

Yifan Pu and Yizeng Han and Yulin Wang and Junlan Feng and Chao Deng and Gao Huang , title =. 2024 , volume =

work page 2024

[38] [38]

Enhancing

Wang, Yeyuan and Gao, Dehong and Yi, Lei and Jin, Linbo and Zhang, Jinxia and Yang, Libin and Cai, Xiaoyan , booktitle = AAAI, pages =. Enhancing

work page

[39] [39]

Yoon, Hee Suk and Yoon, Eunseop and Tee, Joshua Tian Jin and Hasegawa-Johnson, Mark and Li, Yingzhen and Yoo, Chang D , booktitle=ICLR, pages =. C-

work page

[40] [40]

Clip is strong enough to fight back:

Xing, Songlong and Zhao, Zhengyu and Sebe, Nicu , booktitle=CVPR, pages=. Clip is strong enough to fight back:

work page

[41] [41]

2024 , pages =

Li, Lin and Guan, Haoyan and Qiu, Jianing and Spratling, Michael , title =. 2024 , pages =

work page 2024

[42] [42]

2025 , pages =

Hossain, Md Zarif and Imteaj, Ahmed , title =. 2025 , pages =

work page 2025

[43] [43]

Dong, Junhao and Zhang, Cong and Qu, Xinghua and Ma, Zejun and Koniusz, Piotr and Ong, Yew-Soon , booktitle = NIPS, year =. Robust

work page

[44] [44]

Wang, Xin and Chen, Kai and Zhang, Jiaming and Chen, Jingjing and Ma, Xingjun , booktitle=CVPR, pages=

work page

[45] [45]

Pu, Yifan and Han, Yizeng and Wang, Yulin and Feng, Junlan and Deng, Chao and Huang, Gao , journal=

work page

[46] [46]

Chefer, Hila and Gur, Shir and Wolf, Lior , booktitle=CVPR, pages=

work page

[47] [47]

Cui, Xuanming and Aparcedo, Alejandro and Jang, Young Kyun and Lim, Ser-Nam , booktitle=CVPR, pages=

work page

[48] [48]

Nie, Weili and Guo, Brandon and Huang, Yujia and Xiao, Chaowei and Vahdat, Arash and Anandkumar, Anima , booktitle=ICML, pages=

work page

[49] [49]

Beyond pretrained features:

You, Zunzhi and Liu, Daochang and Han, Bohyung and Xu, Chang , booktitle=NIPS, pages=. Beyond pretrained features:

work page

[50] [50]

Align your prompts:

Abdul Samadh, Jameel and Gani, Mohammad Hanan and Hussein, Noor and Khattak, Muhammad Uzair and Naseer, Muhammad Muzammal and Shahbaz Khan, Fahad and Khan, Salman H , booktitle=. Align your prompts:

work page

[51] [51]

Just shift it:

Sui, Elaine and Wang, Xiaohan and Yeung-Levy, Serena , booktitle=WACV, pages=. Just shift it:

work page

[52] [52]

Jia, Menglin and Tang, Luming and Chen, Bor-Chun and Cardie, Claire and Belongie, Serge and Hariharan, Bharath and Lim, Ser-Nam , booktitle = ECCV, pages =. Visual

work page

[53] [53]

Croce, Francesco and Hein, Matthias , booktitle=ICML, pages=

work page

[54] [54]

Carlini, Nicholas and Wagner, David , booktitle=

work page

[55] [55]

Moosavi-Dezfooli, Seyed-Mohsen and Fawzi, Alhussein and Frossard, Pascal , booktitle=CVPR, pages=. Deep

work page

[56] [56]

Krause, Jonathan and Stark, Michael and Deng, Jia and Fei-Fei, Li , booktitle=

work page

[57] [57]

Bossard, Lukas and Guillaumin, Matthieu and Van Gool, Luc , title =

work page

[58] [58]

and Oliva, Aude and Torralba, Antonio , booktitle=CVPR, title=

Xiao, Jianxiong and Hays, James and Ehinger, Krista A. and Oliva, Aude and Torralba, Antonio , booktitle=CVPR, title=. 2010 , pages=

work page 2010

[59] [59]

Understanding

Chengzhi Mao and Scott Geng and Junfeng Yang and Xin Wang and Carl Vondrick , booktitle=ICLR, year=. Understanding

work page

[60] [60]

Schlarmann, Christian and Singh, Naman Deep and Croce, Francesco and Hein, Matthias , booktitle=ICML, pages=. Robust

work page

[61] [61]

Adversarial robustness:

Chen, Tianlong and Liu, Sijia and Chang, Shiyu and Cheng, Yu and Amini, Lisa and Wang, Zhangyang , booktitle=CVPR, pages=. Adversarial robustness:

work page

[62] [62]

2024 , pages=

Islam, Khawar and Zaheer, Muhammad Zaigham and Mahmood, Arif and Nandakumar, Karthik , booktitle=CVPR, title=. 2024 , pages=

work page 2024

[63] [63]

2025 , pages =

Li, Haoxin and Li, Boyang , title =. 2025 , pages =

work page 2025

[64] [64]

2024 , pages =

Hu, Feiran and Zhang, Chenlin and Guo, Jiangliang and Wei, Xiu-Shen and Zhao, Lin and Xu, Anqi and Gao, Lingyan , title =. 2024 , pages =

work page 2024

[65] [65]

2025 , volume =

Yang, Suorong and Li, Peijia and Xiong, Xin and Shen, Furao and Zhao, Jian , title =. 2025 , volume =

work page 2025

[66] [66]

Wang, Yulin and Huang, Gao and Song, Shiji and Pan, Xuran and Xia, Yitong and Wu, Cheng , journal=

work page

[67] [67]

Eyal Michaeli and Ohad Fried , booktitle=

work page

[68] [68]

Diverse data augmentation with diffusions for effective test-time prompt tuning , author=. Proc. IEEE Int. Conf. Comp. Vis. , pages=

work page

[69] [69]

Enhancing fine-grained vision-language pretraining with negative augmented samples , author=. Proc. Conf. AAAI , pages=

work page

[70] [70]

Advances in Neural Inf

Implicit semantic data augmentation for deep networks , author=. Advances in Neural Inf. Process. Syst. , pages=

work page

[71] [71]

Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and LI, Chongxuan and Cheung, Ngai-Man (Man) and Lin, Min , booktitle =

work page

[72] [72]

Mach. Intell. Res. , volume =. 2023 , author =

work page 2023

[73] [73]

Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , booktitle=CVPR, pages=. Natural

work page

[74] [74]

Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal , booktitle=ICML, pages=. Do. 2019 , organization=

work page 2019

[75] [75]

Hendrycks, Dan and Basart, Steven and Mu, Norman and Kadavath, Saurav and Wang, Frank and Dorundo, Evan and Desai, Rahul and Zhu, Tyler and Parajuli, Samyak and Guo, Mike and others , booktitle=. The

work page

[76] [76]

Learning

Wang, Haohan and Ge, Songwei and Lipton, Zachary and Xing, Eric P , booktitle =. Learning

work page

[77] [77]

Chinese Journal of Electronics , volume =

Towards. Chinese Journal of Electronics , volume =. 2026 , author =

work page 2026

[78] [78]

Chinese Journal of Electronics , volume =

Enhancing the. Chinese Journal of Electronics , volume =. 2025 , author =

work page 2025

[79] [79]

2024 , author =

FSCIL-EACA: Few-Shot Class-Incremental Learning Network Based on Embedding Augmentation and Classifier Adaptation for Image Classification , journal =. 2024 , author =

work page 2024