pith. machine review for the scientific record. sign in

arxiv: 2604.11576 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial robustnessvision-language modelszero-shot transferCLIP finetuningcontrastive learningweb image-text pairsadversarial examples
0
0 comments X

The pith

Adversarial finetuning of CLIP on web image-text pairs with contrastive loss improves zero-shot robustness across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to strengthen the adversarial robustness of vision-language models such as CLIP while preserving their zero-shot transfer to new tasks and domains. Standard approaches finetune only the vision encoder on clean labeled proxy datasets like ImageNet by matching adversarial images to class labels, but this shifts the data distribution and objective away from pretraining and reduces transferability. AdvFLYP instead generates adversarial versions of web-collected image-text pairs and aligns them to their texts using the same contrastive loss as CLIP's original pretraining, plus a regularization term that penalizes large deviations between adversarial and clean image features. Logit-level regularization improves robustness while feature-level regularization helps maintain clean accuracy. This setup is tested on 14 downstream datasets and shown to outperform mainstream finetuning practices.

Core claim

AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and matches them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, it further regularizes by penalizing deviation of adversarial image features. Logit- and feature-level regularization terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show superiority over mainstream practices.

What carries the argument

AdvFLYP paradigm that applies contrastive loss to adversarial web image-text pairs plus a deviation penalty on adversarial image features.

If this is right

  • Robustness transfers better across domains and datasets than label-supervised adversarial finetuning.
  • Logit-level regularization specifically strengthens resistance to attacks while feature-level regularization protects clean performance.
  • The approach works with noisy web-scale pairs instead of curated clean labeled data.
  • Zero-shot capabilities remain closer to the original pretrained model than in standard adversarial finetuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretraining-mimicking recipe could be tested on other vision-language models to check whether the robustness gain generalizes beyond CLIP.
  • Abundant web pairs might allow scaling the method to larger model sizes without the cost of collecting new labeled proxy sets.
  • The regularization split suggests that hybrid loss terms could be tuned per downstream task to trade robustness against accuracy as needed.

Load-bearing premise

Adversarial finetuning on noisy web image-text pairs with contrastive loss plus the proposed regularization will preserve zero-shot transferability across domains better than label-based proxy datasets, without regularization coefficients or web data quality introducing hidden biases or overfitting.

What would settle it

Run the same 14-dataset evaluation protocol but with a new held-out domain dataset and standard adversarial attack strengths; if the label-based ImageNet proxy method achieves higher robust accuracy while matching or exceeding AdvFLYP clean zero-shot accuracy, the superiority claim does not hold.

Figures

Figures reproduced from arXiv: 2604.11576 by Jindong Gu, Nicu Sebe, Philip Torr, Songlong Xing, Weijie Wang, Zhengyu Zhao.

Figure 1
Figure 1. Figure 1: A basic illustration of current mainstream AFT methods [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the formulation of LCLIP , logit-level regularisation Llogit and feature-level regularisation Lfeat. LCLIP is the main loss of AdvFLYP (Eq. (7)). Llogit and Lfeat are only employed in the regularised variant of AdvFLYP (Eq. (13)), denoted as AdvFLYPfull. Tϕ = h gϕ(ti) ∥gϕ(ti)∥ iN i=1 ∈ R N×d are computed as follows: P_\theta ^{adv}=\mathrm {softmax}(X_\theta ^{adv}T^\intercal )\in \mathbb {R}^{… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of AdvFLYPfull averaged over 14 down￾stream datasets versus the amount of image-text pairs from the web. The robust accuracy is evaluated under PGD-10 (ϵ = 1/255). the web and fix the training data amount to 1M to reduce training time. The performance of AdvFLYPfull is projected to continue to further improve if we enlarge the training data amount further. As opposed to previous AFT methods tha… view at source ↗
Figure 4
Figure 4. Figure 4: An example of an image from ImageNet and its descrip [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualisation of adversarial and clean image fea [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP's pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes AdvFLYP, a finetuning method for CLIP that generates adversarial examples from web-collected image-text pairs, optimizes them with a contrastive loss to align adversarial images to their texts, and adds logit- and feature-level regularization to mitigate embedding distortion. It claims this 'finetune like pretrain' paradigm preserves zero-shot adversarial robustness and transferability better than standard adversarial finetuning on curated proxy datasets like ImageNet with classification objectives, with superiority shown across 14 downstream datasets in various domains. Code and weights are released.

Significance. If the central claim holds after addressing controls, the work would demonstrate that aligning finetuning data distribution and objective with pretraining can improve robust zero-shot transfer in vision-language models, addressing a key limitation of current adversarial robustness methods. The broad evaluation across domains and open release of code strengthen potential impact for practical VLM deployment.

major comments (2)
  1. [Experiments (§4, Tables 1-3)] The central claim attributes gains to following the pretraining recipe (web image-text pairs + contrastive loss + regularization) rather than mainstream ImageNet label-based finetuning. However, no ablation is reported that applies the identical contrastive objective and regularization to ImageNet images paired with class-name texts. Without this control, it remains unclear whether the web data distribution or the objective change drives the zero-shot robustness improvements (see skeptic note on confounding). This is load-bearing for the paradigm's novelty.
  2. [Method (§3.2)] The regularization is described as penalizing deviation of adversarial image features (feature-level) and logits, with claims that they benefit robustness and clean accuracy respectively. The exact mathematical form of these terms, the specific coefficient values used, and the procedure for selecting them (e.g., validation set or grid search) are not provided. Since these are free parameters, their sensitivity and selection must be detailed to support reproducibility and rule out hidden biases.
minor comments (3)
  1. [Abstract] Abstract contains a subject-verb agreement issue: 'finetunes CLIP with adversarial images ... and match them' should read 'matches them'.
  2. [Experiments (§4)] The manuscript does not report statistical significance tests, standard deviations across multiple runs, or details on hyperparameter search for the reported gains on the 14 datasets.
  3. [Introduction/Related Work] Related work on adversarial robustness for VLMs (e.g., prior contrastive or prompt-based defenses) should be expanded with explicit comparisons to isolate the contribution of the proposed regularization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and constructive feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity, reproducibility, and support for the central claims.

read point-by-point responses
  1. Referee: [Experiments (§4, Tables 1-3)] The central claim attributes gains to following the pretraining recipe (web image-text pairs + contrastive loss + regularization) rather than mainstream ImageNet label-based finetuning. However, no ablation is reported that applies the identical contrastive objective and regularization to ImageNet images paired with class-name texts. Without this control, it remains unclear whether the web data distribution or the objective change drives the zero-shot robustness improvements (see skeptic note on confounding). This is load-bearing for the paradigm's novelty.

    Authors: We agree that this control ablation is important for isolating the contribution of the web image-text data distribution versus the contrastive objective and regularization. The original manuscript did not include an experiment applying the identical contrastive loss and regularization to ImageNet images paired with class-name texts. In the revised version, we have added this ablation study in Section 4. The new results help clarify the role of data distribution in the observed gains and address potential confounding concerns. revision: yes

  2. Referee: [Method (§3.2)] The regularization is described as penalizing deviation of adversarial image features (feature-level) and logits, with claims that they benefit robustness and clean accuracy respectively. The exact mathematical form of these terms, the specific coefficient values used, and the procedure for selecting them (e.g., validation set or grid search) are not provided. Since these are free parameters, their sensitivity and selection must be detailed to support reproducibility and rule out hidden biases.

    Authors: We thank the referee for noting this omission, which affects reproducibility. The original manuscript provided only a high-level description of the regularization. In the revised Section 3.2, we now include the precise mathematical formulations of both the feature-level and logit-level regularization terms, the specific coefficient values used in all experiments, and the selection procedure (grid search over a held-out validation set). We have also added a brief sensitivity analysis in the supplementary material to demonstrate stability with respect to these hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with downstream validation

full rationale

The paper proposes an empirical finetuning paradigm (AdvFLYP) that applies contrastive loss on web-sourced adversarial image-text pairs plus regularization terms, then validates via accuracy and robustness metrics on 14 held-out datasets. No derivation chain, first-principles prediction, or fitted parameter is presented that reduces by construction to the inputs; performance claims rest on comparative experiments rather than self-referential definitions or self-citation load-bearing. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach relies on standard contrastive loss and adversarial perturbation generation from prior work; no new free parameters beyond typical regularization coefficients are introduced in the abstract, and no new axioms or invented entities are postulated.

free parameters (1)
  • regularization coefficients
    Logit- and feature-level penalty strengths are mentioned and must be chosen or tuned to balance robustness and clean accuracy.

pith-pipeline@v0.9.0 · 5552 in / 1232 out tokens · 35065 ms · 2026-05-10T15:53:41.278771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Medoid Prototype Alignment for Cross-Plant Unknown Attack Detection in Industrial Control Systems

    cs.CR 2026-04 unverdicted novelty 6.0

    Medoid prototype alignment detects unknown attacks across industrial plants by aligning domain-specific medoid summaries rather than raw samples, yielding 0.843 average accuracy on gas and water system transfers.

Reference graph

Works this paper leans on

71 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. InAd- vances in Neural Information Processing Systems. Curran Associates, Inc., 2019. 6, 1

  2. [2]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 6

  3. [3]

    Towards evaluating the robustness of neural networks

    Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017. 2, 3, 6

  4. [4]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 6

  5. [5]

    An analysis of single-layer networks in unsupervised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011. 6

  6. [6]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter- free attacks

    Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter- free attacks. InInternational conference on machine learning, pages 2206–2216. PMLR, 2020. 6

  7. [7]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 4, 6

  8. [8]

    Improving zero-shot adversarial robustness in vision-language models by closed- form alignment of adversarial path simplices

    Junhao Dong, Piotr Koniusz, Yifei Zhang, Hao Zhu, Weiming Liu, Xinghua Qu, and Yew-Soon Ong. Improving zero-shot adversarial robustness in vision-language models by closed- form alignment of adversarial path simplices. InForty-second International Conference on Machine Learning, 2025. 2, 3

  9. [9]

    One-shot learn- ing of object categories.IEEE transactions on pattern analy- sis and machine intelligence, 28(4):594–611, 2006

    Li Fei-Fei, Robert Fergus, and Pietro Perona. One-shot learn- ing of object categories.IEEE transactions on pattern analy- sis and machine intelligence, 28(4):594–611, 2006. 6

  10. [10]

    Finetune like you pretrain: Improved finetuning of zero-shot vision models

    Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023. 2, 3, 8

  11. [11]

    Caltech-256 object category dataset

    Gregory Griffin, Alex Holub, Pietro Perona, et al. Caltech-256 object category dataset. Technical report, Technical Report 7694, California Institute of Technology Pasadena, 2007. 6

  12. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2

  13. [13]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 6

  14. [14]

    The many faces of robust- ness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021. 6, 1

  15. [15]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021. 6, 1

  16. [16]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational confer- ence on machine learning, pages 4904–4916. PMLR, 2021. 1

  17. [17]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InPro- ceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013. 6

  18. [18]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 6

  19. [19]

    Im- agenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im- agenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. 2

  20. [20]

    Fine-tuning can distort pre- trained features and underperform out-of-distribution

    Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pre- trained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022. 3

  21. [21]

    Tiny imagenet visual recognition challenge

    Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. 2015. 2, 6

  22. [22]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR,

  23. [23]

    Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

  24. [24]

    One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

    Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24408–24419, 2024. 3 9

  25. [25]

    Defense against adversarial attacks using high-level representation guided denoiser

    Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense against adversarial attacks using high-level representation guided denoiser. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1778–1787, 2018. 2

  26. [26]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  27. [27]

    Image segmentation us- ing text and image prompts

    Timo L¨uddecke and Alexander Ecker. Image segmentation us- ing text and image prompts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022. 1

  28. [28]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Con- ference on Learning Representations, 2018. 1, 2, 3, 4, 6, 5

  29. [29]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6

  30. [30]

    Understanding zero-shot adversarial robust- ness for large-scale models

    Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversarial robust- ness for large-scale models. InThe Eleventh International Conference on Learning Representations, 2023. 1, 2, 3, 4, 5, 6

  31. [31]

    Context-aware robust fine-tuning.Interna- tional Journal of Computer Vision, 132(5):1685–1700, 2024

    Xiaofeng Mao, Yufeng Chen, Xiaojun Jia, Rong Zhang, Hui Xue, and Zhao Li. Context-aware robust fine-tuning.Interna- tional Journal of Computer Vision, 132(5):1685–1700, 2024. 3

  32. [32]

    Lipsum-FT: Ro- bust fine-tuning of zero-shot models using random text guid- ance

    Giung Nam, Byeongho Heo, and Juho Lee. Lipsum-FT: Ro- bust fine-tuning of zero-shot models using random text guid- ance. InThe Twelfth International Conference on Learning Representations, 2024. 3

  33. [33]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008. 6

  34. [34]

    Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024

    Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sang- doo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024. 3

  35. [35]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 3

  36. [36]

    Chao Pan, Qing Li, and Xin Yao. Adversarial initialization with universal adversarial perturbation: A new approach to fast adversarial training.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21501–21509, 2024. 1

  37. [37]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 6

  38. [38]

    Styleclip: Text-driven manipulation of stylegan imagery

    Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 2085–2094,

  39. [39]

    What does a platypus look like? generating customized prompts for zero-shot image classification

    Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF international conference on computer vision, pages 15691–15701, 2023. 1

  40. [40]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 3, 6

  41. [41]

    Overfitting in adver- sarially robust deep learning

    Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adver- sarially robust deep learning. InInternational conference on machine learning, pages 8093–8104. PMLR, 2020. 2

  42. [42]

    Im- proved zero-shot classification by adapting vlms with text descriptions

    Oindrila Saha, Grant Van Horn, and Subhransu Maji. Im- proved zero-shot classification by adapting vlms with text descriptions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17542–17552,

  43. [43]

    Interpreting and analysing clip’s zero-shot image classification via mutual knowledge.Advances in Neural Information Processing Sys- tems, 37:39597–39631, 2024

    Fawaz Sammani and Nikos Deligiannis. Interpreting and analysing clip’s zero-shot image classification via mutual knowledge.Advances in Neural Information Processing Sys- tems, 37:39597–39631, 2024. 1

  44. [44]

    Robust clip: Unsupervised adversar- ial fine-tuning of vision embeddings for robust large vision- language models

    Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adversar- ial fine-tuning of vision embeddings for robust large vision- language models. InInternational Conference on Machine Learning, pages 43685–43704. PMLR, 2024. 6

  45. [45]

    Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

    Christoph Schuhmann, Robert Kaczmarczyk, Aran Komat- suzaki, Aarush Katta, Richard Vencu, Romain Beaumont, Je- nia Jitsev, Theo Coombes, and Clayton Mullis. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshop Datacentric AI, number FZJ-2022-00923. J¨ulich Supercomputing Center, 2021. 4

  46. [46]

    R-tpt: Improving adversarial robustness of vision-language mod- els through test-time prompt tuning

    Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robustness of vision-language mod- els through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025. 3

  47. [47]

    Intriguing properties of neural networks

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. InInternational Conference on Learning Representations, 2014. 1, 2

  48. [48]

    On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training-free approach

    Baoshun Tong, Hanjiang Lai, Yan Pan, and Jian Yin. On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training-free approach. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19921–19930, 2025. 3

  49. [49]

    Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019. 6, 1

  50. [50]

    Declip: Decoupled learning for open- vocabulary dense perception

    Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. InProceedings of the Computer 10 Vision and Pattern Recognition Conference, pages 14824– 14834, 2025. 1

  51. [51]

    Pre- trained model guided fine-tuning for zero-shot adversarial robustness

    Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre- trained model guided fine-tuning for zero-shot adversarial robustness. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24502–24511,

  52. [52]

    Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models

    Xin Wang, Kai Chen, Jiaming Zhang, Jingjing Chen, and Xingjun Ma. Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19910–19920, 2025. 3

  53. [53]

    Quality text, robust vision: The role of language in enhancing visual robustness of vision-language models

    Futa Waseda, Saku Sugawara, and Isao Echizen. Quality text, robust vision: The role of language in enhancing visual robustness of vision-language models. InProceedings of the 33rd ACM International Conference on Multimedia, pages 4808–4816, 2025. 2, 3

  54. [54]

    Sun database: Large-scale scene recog- nition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recog- nition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 6

  55. [55]

    Clip is strong enough to fight back: Test-time counterattacks towards zero- shot adversarial robustness of clip

    Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero- shot adversarial robustness of clip. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15172–15182, 2025. 2, 3

  56. [56]

    Text-guided attention is all you need for zero-shot robustness in vision- language models.Advances in Neural Information Processing Systems, 37:96424–96448, 2024

    Lu Yu, Haiyang Zhang, and Changsheng Xu. Text-guided attention is all you need for zero-shot robustness in vision- language models.Advances in Neural Information Processing Systems, 37:96424–96448, 2024. 2, 4, 6

  57. [57]

    Theoretically principled trade-off between robustness and accuracy

    Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Lau- rent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR,

  58. [58]

    Adversarial prompt tuning for vision-language models

    Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, and Jitao Sang. Adversarial prompt tuning for vision-language models. InEuropean conference on computer vision, pages 56–72. Springer, 2024. 3

  59. [59]

    CLIPure: Purification in latent space via CLIP for adversarially robust zero-shot classification

    Mingkun Zhang, Keping Bi, Wei Chen, Jiafeng Guo, and Xueqi Cheng. CLIPure: Purification in latent space via CLIP for adversarially robust zero-shot classification. InThe Thir- teenth International Conference on Learning Representations,

  60. [60]

    Point- clip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Point- clip: Point cloud understanding by clip. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022. 1

  61. [61]

    On evaluating adversarial robustness of large vision-language models.Ad- vances in Neural Information Processing Systems, 36:54111– 54138, 2023

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Ad- vances in Neural Information Processing Systems, 36:54111– 54138, 2023. 2

  62. [62]

    Regionclip: Region-based language- image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language- image pretraining. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16793–16803, 2022. 1

  63. [63]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022. 3

  64. [64]

    Learning to prompt for vision-language models.Interna- tional Journal of Computer Vision, 130(9):2337–2348, 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Interna- tional Journal of Computer Vision, 130(9):2337–2348, 2022. 3

  65. [65]

    Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37:3122–3156, 2024

    Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37:3122–3156, 2024. 3

  66. [66]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InThe Twelfth International Conference on Learning Representa- tions, 2024. 1 11 Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models Supplementary Material

  67. [67]

    More Dataset Information In the main paper (Tab. 5), we employ several popular vari- ants of ImageNet that share the pre-defined classes partially or entirely, but have distinctly different data domains, to re- flect the limitations of a leveraging a large extensive dataset with labelled classes as a proxy. These variants include ImageNet-R [14], ImageNet...

  68. [68]

    Other Ablation Studies In the main paper, we conduct ablative studies on the regular- isation terms (Sec. 4.4). In this section, we perform ablative studies on other training settings. 7.1. Data Amount and Batch Size We implement AdvFLYPf ull with varying amounts of image-text pairs and three batch sizes (256, 512 and 1024), and evaluate the performance o...

  69. [69]

    Robustness under Higher Attack Budgets We report the full tables of robustness evaluated under the attack strength of ϵ= 2/255 and ϵ= 4/255 in Tab

    More Experimental Results 8.1. Robustness under Higher Attack Budgets We report the full tables of robustness evaluated under the attack strength of ϵ= 2/255 and ϵ= 4/255 in Tab. 10 and Tab. 11, respectively. It can be seen that our AdvFLYP still consistently outperforms previous the previous AFT paradigm TeCoA [30] and its regularisation-based advance- m...

  70. [70]

    This paradigm is reasonable in the sense that the fine- tuned CLIP is to be deployed in downstream classification datasets

    Training Data Analysis The currentde factostandard paradigm for finetuning VLMs to achieve zero-shot adversarial robustness is largely based on the adversarial training (AT) principles of classical ad- versarial learning [28], which involve a dataset of labelled classes. This paradigm is reasonable in the sense that the fine- tuned CLIP is to be deployed ...

  71. [71]

    Discussion on Feature Regularisation Wang et al. [51] propose logit-level regularisation on top of TeCoA, showing that it boosts the generalisation of ro- bustness and clean accuracy across downstream datasets by penalising the logit discrepancy between the adversarial logits by the target model fθ and adversarial logits by the frozen pretrained vision en...