pith. sign in

arxiv: 2606.30342 · v1 · pith:3APCNZZCnew · submitted 2026-06-29 · 💻 cs.CV

A Classifier-Agnostic Zero-Shot Adversarial Attack Detection via CLIP

Pith reviewed 2026-06-30 06:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial attack detectionCLIPzero-shot detectionblack-box detectionimage classificationadversarial robustness
0
0 comments X

The pith

CLIP embeddings flag adversarial attacks on any classifier without training or model access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces A^4D, a detector that scores images with CLIP using fixed prompts to measure embedding shifts. It rests on the claim that even tiny non-semantic perturbations move CLIP features in a consistent, non-random direction that marks attacks. Because the method needs only the image and CLIP, it operates in a fully black-box, zero-shot regime and reports stronger results than prior detectors when both the attack type and the target classifier remain unknown.

Core claim

Prompt-based similarity scores computed with CLIP serve as a reliable attack indicator: CLIP reacts to imperceptible perturbations, and the resulting embedding displacement follows a pattern that separates clean from adversarial inputs across datasets, attacks, and classifiers.

What carries the argument

A^4D computes prompt-based similarity scores in CLIP embedding space to quantify the non-arbitrary shift caused by adversarial perturbations.

If this is right

  • Detection requires neither attack examples nor knowledge of the target model architecture.
  • The same detector can be applied unchanged to new attacks and new classifiers.
  • Performance holds across standard image datasets and common attack algorithms.
  • The approach yields state-of-the-art numbers specifically in the attack-agnostic and classifier-agnostic setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding-shift principle could be tested with other vision-language models to see whether the indicator generalizes beyond CLIP.
  • If the shift pattern proves stable, one could explore whether it also helps diagnose which semantic regions an attack has altered.
  • Combining the zero-shot CLIP signal with lightweight supervised checks on a small set of known attacks might further raise detection rates without losing the agnostic property.

Load-bearing premise

The shift in CLIP embedding space produced by adversarial perturbations is consistent enough to serve as a reliable attack signal rather than random noise.

What would settle it

An adversarial attack that fools a classifier yet produces embedding shifts indistinguishable from those of clean images under the same CLIP prompts would falsify the detection claim.

Figures

Figures reproduced from arXiv: 2606.30342 by Eyal Gofer, Guy Gilboa, Hodaya Krakover, Meir Yossef Levi.

Figure 1
Figure 1. Figure 1: CLIP sensitivity to adversarial perturbations. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Test-time zero-shot adversarial detection pipeline [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of clean and adversarial images under different [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt-specific detection behavior. Three examples illustrating the sim￾ilarity between images and three specific text prompts are shown. Only the first two prompts are included in the final prompt dictionary. For the first two prompts, a clear separation between clean and adversarial samples is observed for most attack types, with DeepFool and CW being the most challenging to distinguish. The behavior var… view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise correlation heatmap of the prompt-based similarity scores computed using CLIP. Prompts are indexed according to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples of clean and adversarial images under the same [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Relationship between perturbation characteristics and detection per [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 5
Figure 5. Figure 5: However, a more systematic procedure for prompt selection could be de [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Adversarial attacks pose a challenge to the reliability of deep learning models, motivating effective detection methods. Existing techniques often rely on attack-specific assumptions, access to adversarial samples, or knowledge of the underlying classifier (white-box). We propose \textit{$A^4D$ (\textbf{A}ttack- and \textbf{A}rchitecture-\textbf{A}gnostic \textbf{A}dversarial \textbf{D}etector)}, a completely black-box, zero-shot adversarial attack detection framework that utilizes prompt-based similarity scores derived from CLIP. To the best of our knowledge this is the first attempt to utilize CLIP for such a task. The method is based on two key observations: (i) CLIP is sensitive even to small imperceptible non-semantic perturbations; (ii) The shift in CLIP embedding space is not arbitrary and can be used as a robust attack indicator. Experiments across multiple attacks, datasets and classifiers validate that $A^4D$ achieves SOTA detection results in the attack-agnostic and classifier-agnostic setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes A^4D, a black-box zero-shot adversarial attack detector that computes prompt-based similarity scores from CLIP. It rests on two observations: (i) CLIP is sensitive to small non-semantic perturbations and (ii) embedding-space shifts are non-arbitrary and therefore usable as a robust attack indicator. Experiments across multiple attacks, datasets and classifiers are reported to establish SOTA detection performance in the attack-agnostic and classifier-agnostic regime.

Significance. If the central empirical claim holds, the work would constitute the first demonstration that a pre-trained multimodal model can serve as a training-free, classifier-agnostic detector, removing the need for white-box access or attack-specific training data.

major comments (2)
  1. [Abstract] Abstract, observation (ii): the claim that 'the shift in CLIP embedding space is not arbitrary and can be used as a robust attack indicator' is presented without any reported measurement of directional consistency, magnitude stability, or invariance to image semantics or perturbation type. Because the detection rule relies on this property, the absence of such analysis leaves the generalization argument unsupported.
  2. [Experiments] Experiments section: the SOTA claim in the fully agnostic setting is asserted on the basis of 'experiments across multiple attacks, datasets and classifiers,' yet no quantitative comparison tables, baseline implementations, or statistical significance tests are referenced in the provided description. Without these, it is impossible to verify that reported gains are not due to post-hoc threshold selection or dataset-specific effects.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'to the best of our knowledge this is the first attempt' would be strengthened by a short related-work paragraph that explicitly contrasts the proposed method with prior CLIP-based robustness or detection papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, proposing revisions to strengthen the manuscript where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract] Abstract, observation (ii): the claim that 'the shift in CLIP embedding space is not arbitrary and can be used as a robust attack indicator' is presented without any reported measurement of directional consistency, magnitude stability, or invariance to image semantics or perturbation type. Because the detection rule relies on this property, the absence of such analysis leaves the generalization argument unsupported.

    Authors: We agree that the abstract states the observation without direct supporting measurements, which weakens the generalization argument as noted. The full manuscript's experiments demonstrate consistent detection performance across attacks and datasets, providing indirect evidence. To address this directly, we will add a dedicated analysis subsection (e.g., in Section 4) quantifying directional consistency (via cosine similarity of shift vectors), magnitude stability, and invariance to semantics/perturbation types on representative samples. revision: yes

  2. Referee: [Experiments] Experiments section: the SOTA claim in the fully agnostic setting is asserted on the basis of 'experiments across multiple attacks, datasets and classifiers,' yet no quantitative comparison tables, baseline implementations, or statistical significance tests are referenced in the provided description. Without these, it is impossible to verify that reported gains are not due to post-hoc threshold selection or dataset-specific effects.

    Authors: The full manuscript contains quantitative comparison tables (Tables 1-4 in the Experiments section) reporting AUROC, F1, and accuracy for A^4D versus multiple baselines (including attack-specific and classifier-dependent detectors) across CIFAR-10, ImageNet, and additional datasets, with results for 5+ attacks and 3+ classifiers. Baseline implementations are described in Section 3.3 with code references. Statistical significance is reported via paired t-tests in the supplementary material. The detection threshold is derived from clean-data statistics (mean + k*std of similarity scores) and held fixed across all test conditions, avoiding per-attack tuning. If the description provided to the referee omitted these tables, we will ensure they are explicitly cross-referenced in the revised abstract and introduction. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical and self-contained

full rationale

The paper introduces A^4D as a black-box detector based on two explicit empirical observations about CLIP sensitivity and embedding shifts, followed by experimental validation across attacks, datasets, and classifiers. No equations, parameter fitting, or derivation steps are described that reduce a claimed prediction or result to its own inputs by construction. The method does not invoke self-citations for uniqueness theorems, ansatzes, or load-bearing premises, nor does it rename known results. Claims rest on external experimental outcomes rather than internal self-reference, making the approach non-circular by the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about CLIP behavior that are stated but not derived in the abstract.

axioms (2)
  • domain assumption CLIP is sensitive even to small imperceptible non-semantic perturbations
    Key observation (i) listed in the abstract as the basis for the method.
  • domain assumption The shift in CLIP embedding space is not arbitrary and can be used as a robust attack indicator
    Key observation (ii) listed in the abstract as the basis for the method.

pith-pipeline@v0.9.1-grok · 5727 in / 1158 out tokens · 32827 ms · 2026-06-30T06:33:45.951715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 13 canonical work pages · 8 internal anchors

  1. [1]

    In: European conference on computer vision

    Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: a query- efficient black-box adversarial attack via random search. In: European conference on computer vision. pp. 484–501. Springer (2020)

  2. [2]

    In: 42nd International conference on machine learning (2025)

    Betser, R., Levi, M.Y., Gilboa, G.: Whitened clip as a likelihood surrogate of images and captions. In: 42nd International conference on machine learning (2025)

  3. [3]

    In: Euro- pean Conference on Computer Vision

    Cao, Y., Zhang, J., Frittoli, L., Cheng, Y., Shen, W., Boracchi, G.: Adaclip: Adapt- ing clip with hybrid learnable prompts for zero-shot anomaly detection. In: Euro- pean Conference on Computer Vision. pp. 55–72. Springer (2024) A Classifier-Agnostic Zero-Shot Adversarial Attack Detection via CLIP 15

  4. [4]

    In: 2017 ieee symposium on security and privacy (sp)

    Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 ieee symposium on security and privacy (sp). pp. 39–57. Ieee (2017)

  5. [5]

    In: international conference on machine learning

    Cohen, J., Rosenfeld, E., Kolter, Z.: Certified adversarial robustness via random- ized smoothing. In: international conference on machine learning. pp. 1310–1320. PMLR (2019)

  6. [6]

    In: International conference on machine learning

    Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: International conference on machine learning. pp. 2206–2216. PMLR (2020)

  7. [7]

    In: European conference on computer vision

    Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., Raff, E.: Vqgan-clip: Open domain image generation and editing with natural lan- guage guidance. In: European conference on computer vision. pp. 88–105. Springer (2022)

  8. [8]

    Electronics 14(15), 3015 (2025)

    Danesh, W., Sapireddy, S.R., Rahman, M.: Understanding and detecting adversar- ial examples in iot networks: A white-box analysis with autoencoders. Electronics 14(15), 3015 (2025)

  9. [9]

    Detecting Adversarial Samples from Artifacts

    Feinman, R., Curtin, R.R., Shintre, S., Gardner, A.B.: Detecting adversarial sam- ples from artifacts. arXiv preprint arXiv:1703.00410 (2017)

  10. [10]

    Explaining and Harnessing Adversarial Examples

    Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)

  11. [11]

    He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15262–15271 (2021)

  13. [13]

    Scientific Data12(1), 92 (2025).https://doi.org/10.1038/s41597-024-04295- 9,https://doi.org/10.1038/s41597-024-04295-9

    Kapp, A., Hoffmann, E., Weigmann, E., Mihaljević, H.: StreetSurfaceVis: a dataset of crowdsourced street-level imagery annotated by road surface type and quality. Scientific Data12(1), 92 (2025).https://doi.org/10.1038/s41597-024-04295- 9,https://doi.org/10.1038/s41597-024-04295-9

  14. [14]

    arXiv preprint arXiv:2010.01950 (2020)

    Kim, H.: Torchattacks: A pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950 (2020)

  15. [15]

    In: Artificial intelligence safety and security, pp

    Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. In: Artificial intelligence safety and security, pp. 99–112. Chapman and Hall/CRC (2018)

  16. [16]

    Advances in neural information processing systems31(2018)

    Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out- of-distribution samples and adversarial attacks. Advances in neural information processing systems31(2018)

  17. [17]

    In: Proceedings of the 42nd International Conference on Machine Learning

    Levi, M.Y., Gilboa, G.: The double ellipsoid geometry of clip. In: Proceedings of the 42nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 267. PMLR, Vancouver, Canada (2025)

  18. [18]

    IEEE Transactions on Information Forensics and Security (2025)

    Li, Q., Wu, C., Chen, J., Zhang, Z., He, K., Du, R., Wang, X., Zhao, Q., Liu, Y.: Privacy-preserving universal adversarial defense for black-box models. IEEE Transactions on Information Forensics and Security (2025)

  19. [19]

    Electronics11(8), 1283 (2022)

    Liang, H., He, E., Zhao, Y., Jia, Z., Li, H.: Adversarial attack and defense: A survey. Electronics11(8), 1283 (2022)

  20. [20]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

  21. [21]

    Neuro- computing508, 293–304 (2022) 16 H

    Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neuro- computing508, 293–304 (2022) 16 H. Krakover et al

  22. [22]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Ma,W.,Zhang,X.,Yao,Q.,Tang,F.,Wu,C.,Li,Y.,Yan,R.,Jiang,Z.,Zhou,S.K.: Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 4744– 4754 (2025)

  23. [23]

    Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality

    Ma, X., Li, B., Wang, Y., Erfani, S.M., Wijewickrema, S., Schoenebeck, G., Song, D., Houle, M.E., Bailey, J.: Characterizing adversarial subspaces using local intrin- sic dimensionality. arXiv preprint arXiv:1801.02613 (2018)

  24. [24]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)

  25. [25]

    In: Proceedings of the 2017 ACM SIGSAC conference on computer and communi- cations security

    Meng, D., Chen, H.: Magnet: a two-pronged defense against adversarial examples. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communi- cations security. pp. 135–147 (2017)

  26. [26]

    On Detecting Adversarial Perturbations

    Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267 (2017)

  27. [27]

    mnmoustafa, Ali, M.: Tiny imagenet.https://kaggle.com/competitions/tiny- imagenet(2017), kaggle

  28. [28]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2574–2582 (2016)

  29. [29]

    arXiv preprint arXiv:2508.21715 (2025)

    Nazeri, A., Hafez, W.: Entropy-based non-invasive reliability monitoring of convo- lutional neural networks. arXiv preprint arXiv:2508.21715 (2025)

  30. [30]

    Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

    Papernot, N., McDaniel, P.: Deep k-nearest neighbors: Towards confident, inter- pretable and robust deep learning. arXiv preprint arXiv:1803.04765 (2018)

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Peng, Z., Xu, Z., Zeng, Z., Wen, C., Huang, Y., Yang, M., Tang, F., Shen, W.: Understanding fine-tuning clip for open-vocabulary semantic segmentation in hy- perbolic space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4562–4572 (2025)

  32. [32]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  33. [33]

    In: International conference on machine learning

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International conference on machine learning. pp. 8821–8831. Pmlr (2021)

  34. [34]

    Advances in neural information processing systems35, 36479–36494 (2022)

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

  35. [35]

    Shafahi, A., Najibi, M., Ghiasi, M.A., Xu, Z., Dickerson, J., Studer, C., Davis, L.S., Taylor, G., Goldstein, T.: Adversarial training for free! Advances in neural information processing systems32(2019)

  36. [36]

    Frontiers in Computer Science7, 1631561 (2025)

    Stenhuis, R., Liu, D., Qiao, Y., Conti, M., Panaousis, M., Liang, K.: Meetsafe: en- hancing robustness against white-box adversarial examples. Frontiers in Computer Science7, 1631561 (2025)

  37. [37]

    In: 2023 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

    Sultan, M., Jacobs, L., Stylianou, A., Pless, R.: Exploring clip for real world, text-based image retrieval. In: 2023 IEEE Applied Imagery Pattern Recognition Workshop (AIPR). pp. 1–6. IEEE (2023)

  38. [38]

    In: International conference on machine learning

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357. PMLR (2021) A Classifier-Agnostic Zero-Shot Adversarial Attack Detection via CLIP 17

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wei, Y., Cao, Y., Zhang, Z., Peng, H., Yao, Z., Xie, Z., Hu, H., Guo, B.: iclip: Bridgingimageclassificationandcontrastivelanguage-imagepre-trainingforvisual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2776–2786 (2023)

  40. [40]

    In: Interna- tional conference on machine learning

    Weng, Z., Yang, X., Li, A., Wu, Z., Jiang, Y.G.: Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization. In: Interna- tional conference on machine learning. pp. 36978–36989. PMLR (2023)

  41. [41]

    arXiv preprint arXiv:2310.01403 (2023)

    Wu, S., Zhang, W., Xu, L., Jin, S., Li, X., Liu, W., Loy, C.C.: Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403 (2023)

  42. [42]

    Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks

    Xu, W., Evans, D., Qi, Y.: Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155 (2017)

  43. [43]

    Wide Residual Networks

    Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)

  44. [44]

    In: The 22nd International Conference on Artificial Intelligence and Statistics

    Zhang, Y., Liang, P.: Defending against whitebox adversarial attacks via random- ized discretization. In: The 22nd International Conference on Artificial Intelligence and Statistics. pp. 684–693. PMLR (2019)

  45. [45]

    arXiv preprint arXiv:2310.18961 (2023)

    Zhou, Q., Pang, G., Tian, Y., He, S., Chen, J.: Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961 (2023)