arxiv: 2604.11576 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

Songlong Xing , Weijie Wang , Zhengyu Zhao , Jindong Gu , Philip Torr , Nicu Sebe

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial robustnessvision-language modelszero-shot transferCLIP finetuningcontrastive learningweb image-text pairsadversarial examples

0 comments

The pith

Adversarial finetuning of CLIP on web image-text pairs with contrastive loss improves zero-shot robustness across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to strengthen the adversarial robustness of vision-language models such as CLIP while preserving their zero-shot transfer to new tasks and domains. Standard approaches finetune only the vision encoder on clean labeled proxy datasets like ImageNet by matching adversarial images to class labels, but this shifts the data distribution and objective away from pretraining and reduces transferability. AdvFLYP instead generates adversarial versions of web-collected image-text pairs and aligns them to their texts using the same contrastive loss as CLIP's original pretraining, plus a regularization term that penalizes large deviations between adversarial and clean image features. Logit-level regularization improves robustness while feature-level regularization helps maintain clean accuracy. This setup is tested on 14 downstream datasets and shown to outperform mainstream finetuning practices.

Core claim

AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and matches them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, it further regularizes by penalizing deviation of adversarial image features. Logit- and feature-level regularization terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show superiority over mainstream practices.

What carries the argument

AdvFLYP paradigm that applies contrastive loss to adversarial web image-text pairs plus a deviation penalty on adversarial image features.

If this is right

Robustness transfers better across domains and datasets than label-supervised adversarial finetuning.
Logit-level regularization specifically strengthens resistance to attacks while feature-level regularization protects clean performance.
The approach works with noisy web-scale pairs instead of curated clean labeled data.
Zero-shot capabilities remain closer to the original pretrained model than in standard adversarial finetuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining-mimicking recipe could be tested on other vision-language models to check whether the robustness gain generalizes beyond CLIP.
Abundant web pairs might allow scaling the method to larger model sizes without the cost of collecting new labeled proxy sets.
The regularization split suggests that hybrid loss terms could be tuned per downstream task to trade robustness against accuracy as needed.

Load-bearing premise

Adversarial finetuning on noisy web image-text pairs with contrastive loss plus the proposed regularization will preserve zero-shot transferability across domains better than label-based proxy datasets, without regularization coefficients or web data quality introducing hidden biases or overfitting.

What would settle it

Run the same 14-dataset evaluation protocol but with a new held-out domain dataset and standard adversarial attack strengths; if the label-based ImageNet proxy method achieves higher robust accuracy while matching or exceeding AdvFLYP clean zero-shot accuracy, the superiority claim does not hold.

Figures

Figures reproduced from arXiv: 2604.11576 by Jindong Gu, Nicu Sebe, Philip Torr, Songlong Xing, Weijie Wang, Zhengyu Zhao.

**Figure 2.** Figure 2: Overview of the formulation of LCLIP , logit-level regularisation Llogit and feature-level regularisation Lfeat. LCLIP is the main loss of AdvFLYP (Eq. (7)). Llogit and Lfeat are only employed in the regularised variant of AdvFLYP (Eq. (13)), denoted as AdvFLYPfull. Tϕ = h gϕ(ti) ∥gϕ(ti)∥ iN i=1 ∈ R N×d are computed as follows: P_\theta ^{adv}=\mathrm {softmax}(X_\theta ^{adv}T^\intercal )\in \mathbb {R}^{… view at source ↗

**Figure 3.** Figure 3: Performance of AdvFLYPfull averaged over 14 downstream datasets versus the amount of image-text pairs from the web. The robust accuracy is evaluated under PGD-10 (ϵ = 1/255). the web and fix the training data amount to 1M to reduce training time. The performance of AdvFLYPfull is projected to continue to further improve if we enlarge the training data amount further. As opposed to previous AFT methods tha… view at source ↗

**Figure 4.** Figure 4: An example of an image from ImageNet and its descrip [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualisation of adversarial and clean image fea [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP's pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper improves zero-shot robustness by adversarial finetuning on web image-text pairs with contrastive loss, but does not isolate whether the pretraining match or the data source is responsible for the gains.

read the letter

AdvFLYP shows that you can boost zero-shot adversarial robustness in CLIP by doing the finetuning step with web image-text pairs and contrastive loss, the same way the model was pretrained, instead of using labeled ImageNet data and classification. They create adversarial examples from those pairs and train with contrastive matching, then add regularization to keep the adversarial image features from drifting too far and to stabilize the logits. This setup beats the standard proxy dataset methods across 14 different downstream datasets. The strength is in the practical results and the code release. Running the same idea on many domains gives a sense that the robustness transfers better. The split regularization is a nice touch, with logit penalty helping robustness and feature penalty helping clean performance. The soft spot is the missing control for what actually drives the improvement. Mainstream baselines use clean ImageNet images with class labels and a classification loss. Here both the data distribution and the objective change together. If the gains mostly come from the web data being more diverse or from the contrastive loss itself, then the finetune like you pretrain story is only part of it. The paper would be stronger with an ablation that applies the contrastive loss and regularization to ImageNet pairs to isolate the effect. The web data is noisy by nature, so the regularization is necessary, but it would help to see how much the results depend on the choice of those coefficients. This paper is for researchers working on robust multimodal models. Anyone trying to deploy CLIP-like systems in settings where adversarial attacks matter will find the recipe useful. The experiments are broad enough and the code is available, so it has the grounding to go through peer review. I would recommend sending it to referees with a request for those extra ablations.

Referee Report

2 major / 3 minor

Summary. The paper proposes AdvFLYP, a finetuning method for CLIP that generates adversarial examples from web-collected image-text pairs, optimizes them with a contrastive loss to align adversarial images to their texts, and adds logit- and feature-level regularization to mitigate embedding distortion. It claims this 'finetune like pretrain' paradigm preserves zero-shot adversarial robustness and transferability better than standard adversarial finetuning on curated proxy datasets like ImageNet with classification objectives, with superiority shown across 14 downstream datasets in various domains. Code and weights are released.

Significance. If the central claim holds after addressing controls, the work would demonstrate that aligning finetuning data distribution and objective with pretraining can improve robust zero-shot transfer in vision-language models, addressing a key limitation of current adversarial robustness methods. The broad evaluation across domains and open release of code strengthen potential impact for practical VLM deployment.

major comments (2)

[Experiments (§4, Tables 1-3)] The central claim attributes gains to following the pretraining recipe (web image-text pairs + contrastive loss + regularization) rather than mainstream ImageNet label-based finetuning. However, no ablation is reported that applies the identical contrastive objective and regularization to ImageNet images paired with class-name texts. Without this control, it remains unclear whether the web data distribution or the objective change drives the zero-shot robustness improvements (see skeptic note on confounding). This is load-bearing for the paradigm's novelty.
[Method (§3.2)] The regularization is described as penalizing deviation of adversarial image features (feature-level) and logits, with claims that they benefit robustness and clean accuracy respectively. The exact mathematical form of these terms, the specific coefficient values used, and the procedure for selecting them (e.g., validation set or grid search) are not provided. Since these are free parameters, their sensitivity and selection must be detailed to support reproducibility and rule out hidden biases.

minor comments (3)

[Abstract] Abstract contains a subject-verb agreement issue: 'finetunes CLIP with adversarial images ... and match them' should read 'matches them'.
[Experiments (§4)] The manuscript does not report statistical significance tests, standard deviations across multiple runs, or details on hyperparameter search for the reported gains on the 14 datasets.
[Introduction/Related Work] Related work on adversarial robustness for VLMs (e.g., prior contrastive or prompt-based defenses) should be expanded with explicit comparisons to isolate the contribution of the proposed regularization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and constructive feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity, reproducibility, and support for the central claims.

read point-by-point responses

Referee: [Experiments (§4, Tables 1-3)] The central claim attributes gains to following the pretraining recipe (web image-text pairs + contrastive loss + regularization) rather than mainstream ImageNet label-based finetuning. However, no ablation is reported that applies the identical contrastive objective and regularization to ImageNet images paired with class-name texts. Without this control, it remains unclear whether the web data distribution or the objective change drives the zero-shot robustness improvements (see skeptic note on confounding). This is load-bearing for the paradigm's novelty.

Authors: We agree that this control ablation is important for isolating the contribution of the web image-text data distribution versus the contrastive objective and regularization. The original manuscript did not include an experiment applying the identical contrastive loss and regularization to ImageNet images paired with class-name texts. In the revised version, we have added this ablation study in Section 4. The new results help clarify the role of data distribution in the observed gains and address potential confounding concerns. revision: yes
Referee: [Method (§3.2)] The regularization is described as penalizing deviation of adversarial image features (feature-level) and logits, with claims that they benefit robustness and clean accuracy respectively. The exact mathematical form of these terms, the specific coefficient values used, and the procedure for selecting them (e.g., validation set or grid search) are not provided. Since these are free parameters, their sensitivity and selection must be detailed to support reproducibility and rule out hidden biases.

Authors: We thank the referee for noting this omission, which affects reproducibility. The original manuscript provided only a high-level description of the regularization. In the revised Section 3.2, we now include the precise mathematical formulations of both the feature-level and logit-level regularization terms, the specific coefficient values used in all experiments, and the selection procedure (grid search over a held-out validation set). We have also added a brief sensitivity analysis in the supplementary material to demonstrate stability with respect to these hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with downstream validation

full rationale

The paper proposes an empirical finetuning paradigm (AdvFLYP) that applies contrastive loss on web-sourced adversarial image-text pairs plus regularization terms, then validates via accuracy and robustness metrics on 14 held-out datasets. No derivation chain, first-principles prediction, or fitted parameter is presented that reduces by construction to the inputs; performance claims rest on comparative experiments rather than self-referential definitions or self-citation load-bearing. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach relies on standard contrastive loss and adversarial perturbation generation from prior work; no new free parameters beyond typical regularization coefficients are introduced in the abstract, and no new axioms or invented entities are postulated.

free parameters (1)

regularization coefficients
Logit- and feature-level penalty strengths are mentioned and must be chosen or tuned to balance robustness and clean accuracy.

pith-pipeline@v0.9.0 · 5552 in / 1232 out tokens · 35065 ms · 2026-05-10T15:53:41.278771+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Medoid Prototype Alignment for Cross-Plant Unknown Attack Detection in Industrial Control Systems
cs.CR 2026-04 unverdicted novelty 6.0

Medoid prototype alignment detects unknown attacks across industrial plants by aligning domain-specific medoid summaries rather than raw samples, yielding 0.843 average accuracy on gas and water system transfers.

Reference graph

Works this paper leans on

71 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. InAd- vances in Neural Information Processing Systems. Curran Associates, Inc., 2019. 6, 1

2019
[2]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 6

2014
[3]

Towards evaluating the robustness of neural networks

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017. 2, 3, 6

2017
[4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 6

2014
[5]

An analysis of single-layer networks in unsupervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011. 6

2011
[6]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter- free attacks

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter- free attacks. InInternational conference on machine learning, pages 2206–2216. PMLR, 2020. 6

2020
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 4, 6

2009
[8]

Improving zero-shot adversarial robustness in vision-language models by closed- form alignment of adversarial path simplices

Junhao Dong, Piotr Koniusz, Yifei Zhang, Hao Zhu, Weiming Liu, Xinghua Qu, and Yew-Soon Ong. Improving zero-shot adversarial robustness in vision-language models by closed- form alignment of adversarial path simplices. InForty-second International Conference on Machine Learning, 2025. 2, 3

2025
[9]

One-shot learn- ing of object categories.IEEE transactions on pattern analy- sis and machine intelligence, 28(4):594–611, 2006

Li Fei-Fei, Robert Fergus, and Pietro Perona. One-shot learn- ing of object categories.IEEE transactions on pattern analy- sis and machine intelligence, 28(4):594–611, 2006. 6

2006
[10]

Finetune like you pretrain: Improved finetuning of zero-shot vision models

Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023. 2, 3, 8

2023
[11]

Caltech-256 object category dataset

Gregory Griffin, Alex Holub, Pietro Perona, et al. Caltech-256 object category dataset. Technical report, Technical Report 7694, California Institute of Technology Pasadena, 2007. 6

2007
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2

2016
[13]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 6

2019
[14]

The many faces of robust- ness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021. 6, 1

2021
[15]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021. 6, 1

2021
[16]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational confer- ence on machine learning, pages 4904–4916. PMLR, 2021. 1

2021
[17]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InPro- ceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013. 6

2013
[18]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 6

2009
[19]

Im- agenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im- agenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. 2

2012
[20]

Fine-tuning can distort pre- trained features and underperform out-of-distribution

Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pre- trained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022. 3

2022
[21]

Tiny imagenet visual recognition challenge

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. 2015. 2, 6

2015
[22]

Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR,
[23]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,
[24]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24408–24419, 2024. 3 9

2024
[25]

Defense against adversarial attacks using high-level representation guided denoiser

Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense against adversarial attacks using high-level representation guided denoiser. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1778–1787, 2018. 2

2018
[26]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

2023
[27]

Image segmentation us- ing text and image prompts

Timo L¨uddecke and Alexander Ecker. Image segmentation us- ing text and image prompts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022. 1

2022
[28]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Con- ference on Learning Representations, 2018. 1, 2, 3, 4, 6, 5

2018
[29]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6

work page internal anchor Pith review arXiv 2013
[30]

Understanding zero-shot adversarial robust- ness for large-scale models

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversarial robust- ness for large-scale models. InThe Eleventh International Conference on Learning Representations, 2023. 1, 2, 3, 4, 5, 6

2023
[31]

Context-aware robust fine-tuning.Interna- tional Journal of Computer Vision, 132(5):1685–1700, 2024

Xiaofeng Mao, Yufeng Chen, Xiaojun Jia, Rong Zhang, Hui Xue, and Zhao Li. Context-aware robust fine-tuning.Interna- tional Journal of Computer Vision, 132(5):1685–1700, 2024. 3

2024
[32]

Lipsum-FT: Ro- bust fine-tuning of zero-shot models using random text guid- ance

Giung Nam, Byeongho Heo, and Juho Lee. Lipsum-FT: Ro- bust fine-tuning of zero-shot models using random text guid- ance. InThe Twelfth International Conference on Learning Representations, 2024. 3

2024
[33]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008. 6

2008
[34]

Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sang- doo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024. 3

2024
[35]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Chao Pan, Qing Li, and Xin Yao. Adversarial initialization with universal adversarial perturbation: A new approach to fast adversarial training.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21501–21509, 2024. 1

2024
[37]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 6

2012
[38]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 2085–2094,

2085
[39]

What does a platypus look like? generating customized prompts for zero-shot image classification

Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF international conference on computer vision, pages 15691–15701, 2023. 1

2023
[40]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 3, 6

2021
[41]

Overfitting in adver- sarially robust deep learning

Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adver- sarially robust deep learning. InInternational conference on machine learning, pages 8093–8104. PMLR, 2020. 2

2020
[42]

Im- proved zero-shot classification by adapting vlms with text descriptions

Oindrila Saha, Grant Van Horn, and Subhransu Maji. Im- proved zero-shot classification by adapting vlms with text descriptions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17542–17552,
[43]

Interpreting and analysing clip’s zero-shot image classification via mutual knowledge.Advances in Neural Information Processing Sys- tems, 37:39597–39631, 2024

Fawaz Sammani and Nikos Deligiannis. Interpreting and analysing clip’s zero-shot image classification via mutual knowledge.Advances in Neural Information Processing Sys- tems, 37:39597–39631, 2024. 1

2024
[44]

Robust clip: Unsupervised adversar- ial fine-tuning of vision embeddings for robust large vision- language models

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adversar- ial fine-tuning of vision embeddings for robust large vision- language models. InInternational Conference on Machine Learning, pages 43685–43704. PMLR, 2024. 6

2024
[45]

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

Christoph Schuhmann, Robert Kaczmarczyk, Aran Komat- suzaki, Aarush Katta, Richard Vencu, Romain Beaumont, Je- nia Jitsev, Theo Coombes, and Clayton Mullis. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshop Datacentric AI, number FZJ-2022-00923. J¨ulich Supercomputing Center, 2021. 4

2022
[46]

R-tpt: Improving adversarial robustness of vision-language mod- els through test-time prompt tuning

Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robustness of vision-language mod- els through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025. 3

2025
[47]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. InInternational Conference on Learning Representations, 2014. 1, 2

2014
[48]

On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training-free approach

Baoshun Tong, Hanjiang Lai, Yan Pan, and Jian Yin. On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training-free approach. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19921–19930, 2025. 3

2025
[49]

Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019. 6, 1

2019
[50]

Declip: Decoupled learning for open- vocabulary dense perception

Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. InProceedings of the Computer 10 Vision and Pattern Recognition Conference, pages 14824– 14834, 2025. 1

2025
[51]

Pre- trained model guided fine-tuning for zero-shot adversarial robustness

Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre- trained model guided fine-tuning for zero-shot adversarial robustness. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24502–24511,
[52]

Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models

Xin Wang, Kai Chen, Jiaming Zhang, Jingjing Chen, and Xingjun Ma. Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19910–19920, 2025. 3

2025
[53]

Quality text, robust vision: The role of language in enhancing visual robustness of vision-language models

Futa Waseda, Saku Sugawara, and Isao Echizen. Quality text, robust vision: The role of language in enhancing visual robustness of vision-language models. InProceedings of the 33rd ACM International Conference on Multimedia, pages 4808–4816, 2025. 2, 3

2025
[54]

Sun database: Large-scale scene recog- nition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recog- nition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 6

2010
[55]

Clip is strong enough to fight back: Test-time counterattacks towards zero- shot adversarial robustness of clip

Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero- shot adversarial robustness of clip. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15172–15182, 2025. 2, 3

2025
[56]

Text-guided attention is all you need for zero-shot robustness in vision- language models.Advances in Neural Information Processing Systems, 37:96424–96448, 2024

Lu Yu, Haiyang Zhang, and Changsheng Xu. Text-guided attention is all you need for zero-shot robustness in vision- language models.Advances in Neural Information Processing Systems, 37:96424–96448, 2024. 2, 4, 6

2024
[57]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Lau- rent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR,
[58]

Adversarial prompt tuning for vision-language models

Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, and Jitao Sang. Adversarial prompt tuning for vision-language models. InEuropean conference on computer vision, pages 56–72. Springer, 2024. 3

2024
[59]

CLIPure: Purification in latent space via CLIP for adversarially robust zero-shot classification

Mingkun Zhang, Keping Bi, Wei Chen, Jiafeng Guo, and Xueqi Cheng. CLIPure: Purification in latent space via CLIP for adversarially robust zero-shot classification. InThe Thir- teenth International Conference on Learning Representations,
[60]

Point- clip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Point- clip: Point cloud understanding by clip. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022. 1

2022
[61]

On evaluating adversarial robustness of large vision-language models.Ad- vances in Neural Information Processing Systems, 36:54111– 54138, 2023

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Ad- vances in Neural Information Processing Systems, 36:54111– 54138, 2023. 2

2023
[62]

Regionclip: Region-based language- image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language- image pretraining. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16793–16803, 2022. 1

2022
[63]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022. 3

2022
[64]

Learning to prompt for vision-language models.Interna- tional Journal of Computer Vision, 130(9):2337–2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Interna- tional Journal of Computer Vision, 130(9):2337–2348, 2022. 3

2022
[65]

Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37:3122–3156, 2024

Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37:3122–3156, 2024. 3

2024
[66]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InThe Twelfth International Conference on Learning Representa- tions, 2024. 1 11 Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models Supplementary Material

2024
[67]

More Dataset Information In the main paper (Tab. 5), we employ several popular vari- ants of ImageNet that share the pre-defined classes partially or entirely, but have distinctly different data domains, to re- flect the limitations of a leveraging a large extensive dataset with labelled classes as a proxy. These variants include ImageNet-R [14], ImageNet...
[68]

Other Ablation Studies In the main paper, we conduct ablative studies on the regular- isation terms (Sec. 4.4). In this section, we perform ablative studies on other training settings. 7.1. Data Amount and Batch Size We implement AdvFLYPf ull with varying amounts of image-text pairs and three batch sizes (256, 512 and 1024), and evaluate the performance o...
[69]

Robustness under Higher Attack Budgets We report the full tables of robustness evaluated under the attack strength of ϵ= 2/255 and ϵ= 4/255 in Tab

More Experimental Results 8.1. Robustness under Higher Attack Budgets We report the full tables of robustness evaluated under the attack strength of ϵ= 2/255 and ϵ= 4/255 in Tab. 10 and Tab. 11, respectively. It can be seen that our AdvFLYP still consistently outperforms previous the previous AFT paradigm TeCoA [30] and its regularisation-based advance- m...

work page arXiv 1917
[70]

This paradigm is reasonable in the sense that the fine- tuned CLIP is to be deployed in downstream classification datasets

Training Data Analysis The currentde factostandard paradigm for finetuning VLMs to achieve zero-shot adversarial robustness is largely based on the adversarial training (AT) principles of classical ad- versarial learning [28], which involve a dataset of labelled classes. This paradigm is reasonable in the sense that the fine- tuned CLIP is to be deployed ...
[71]

Discussion on Feature Regularisation Wang et al. [51] propose logit-level regularisation on top of TeCoA, showing that it boosts the generalisation of ro- bustness and clean accuracy across downstream datasets by penalising the logit discrepancy between the adversarial logits by the target model fθ and adversarial logits by the frozen pretrained vision en...