pith. sign in

arxiv: 2502.05206 · v6 · submitted 2025-02-02 · 💻 cs.CR · cs.AI· cs.CL· cs.CV

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Pith reviewed 2026-05-23 04:37 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.CV
keywords large language modelsvision-language modelsadversarial attacksjailbreak attacksmodel safetybackdoor attacksagent safetydiffusion models
0
0 comments X

The pith

Large models and agents face categorized safety threats from adversarial attacks to jailbreaks and agent-specific risks, with defenses and open challenges reviewed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey organizes safety threats to vision foundation models, large language models, vision-language models, diffusion models, and agents into a taxonomy that includes adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. It reviews available defense strategies for each threat type and summarizes commonly used datasets and benchmarks. The work then identifies open challenges such as the need for comprehensive safety evaluations, scalable defense mechanisms, and sustainable data practices, while calling for collective research community efforts and international collaboration to build better defense systems.

Core claim

The paper presents a comprehensive taxonomy of safety threats to large models and agents, reviews defense strategies where available, summarizes datasets and benchmarks, and discusses open challenges including comprehensive safety evaluations, scalable defenses, and sustainable data practices, along with the necessity of collective efforts.

What carries the argument

the taxonomy of safety threats covering adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and agent-specific threats

If this is right

  • Defense strategies can be systematically developed and matched to each threat category in the taxonomy.
  • Researchers gain a shared reference for selecting datasets and benchmarks in safety evaluations.
  • Open challenges highlight priorities for future work on scalable defenses and sustainable data practices.
  • Collective community efforts are positioned as necessary to create comprehensive defense platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could serve as a starting point for automated tools that scan new papers for emerging threats not yet listed.
  • Similar survey structures might be applied to safety in other rapidly scaling AI domains such as robotics or multimodal systems.
  • If agent-specific threats grow in importance, the taxonomy may need periodic updates to remain current.

Load-bearing premise

The survey's taxonomy and coverage are assumed to be representative of the full landscape of large model safety research without major omissions in a fast-moving field.

What would settle it

Identification of a major, previously undocumented category of safety threat to large models or agents that falls outside the proposed taxonomy would undermine the claim of comprehensive coverage.

Figures

Figures reproduced from arXiv: 2502.05206 by Baoyuan Wu, Bo Li, Chaowei Xiao, Cihang Xie, Cong Wang, Dacheng Tao, Hanxun Huang, Haonan Li, Hengyuan Xu, James Bailey, Jiaming Zhang, Jindong Gu, Jingfeng Zhang, Jun Sun, Masashi Sugiyama, Mingming Gong, Neil Gong, Ruofan Wang, Ruoxi Jia, Sarah Erfani, Shiqing Ma, Shirui Pan, Siheng Chen, Tianwei Zhang, Tianyu Pang, Tim Baldwin, Tongliang Liu, Xiangyu Zhang, Xiang Zheng, Xingjun Ma, Xin Wang, Xipeng Qiu, Xudong Han, Yang Bai, Yang Liu, Yang Zhang, Ye Sun, Yifan Ding, Yifeng Gao, Yige Li, Yiming Li, Yinpeng Dong, Yixu Wang, Yu-Gang Jiang, Yunhan Zhao, Yunhao Chen, Yutao Wu, Zuxuan Wu.

Figure 1
Figure 1. Figure 1: Left: The number of surveyed technical papers on attacks, defenses, and benchmarks/datasets. Middle: Distribution of surveyed technical papers by model type. Right: Distribution of surveyed technical papers by attack and defense type. Trend: odel - Attack/Defense Number of works Attack Defense ~2021 2022 2023 2024 July 2025 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: The quarterly trend in the number of surveyed safety papers across different models; Middle: Proportional distribution of attack and defense studies associated with large models. Right: Annual trend in the number of surveyed safety papers on various attacks and defenses, ordered from most to least studied. models is paramount to prevent such unintended consequences, maintain public trust, and promote… view at source ↗
Figure 3
Figure 3. Figure 3: A road map of this survey [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents a systematic survey of safety issues in large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. It contributes (1) a taxonomy of safety threats including adversarial attacks, data poisoning, backdoor attacks, jailbreak/prompt injection, energy-latency attacks, data/model extraction, and agent-specific threats; (2) a review of corresponding defense strategies and commonly used datasets/benchmarks; and (3) discussion of open challenges such as comprehensive evaluations, scalable defenses, and sustainable data practices, while calling for community and international collaboration.

Significance. If the taxonomy and coverage prove representative, the survey would provide a structured reference that organizes a fast-moving literature and surfaces actionable open problems (e.g., scalable defenses and multi-agent threats). The explicit inclusion of agent-specific threats and the call for collective efforts are constructive organizational contributions typical of useful surveys in the field.

major comments (1)
  1. [Introduction / abstract] Introduction / abstract (and any methods subsection): The central claim of a 'comprehensive taxonomy' and 'systematic review' is load-bearing for the paper's contribution, yet no literature-search protocol, database list, keyword set, inclusion/exclusion criteria, or temporal cutoff is stated. In a field that publishes hundreds of safety papers annually, the absence of such details prevents assessment of whether categories (e.g., multi-agent collusion or hardware-side extraction) are exhaustively covered or whether the taxonomy is representative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on methodological transparency. We address the concern below and will revise the manuscript to strengthen the presentation of our survey process.

read point-by-point responses
  1. Referee: [Introduction / abstract] Introduction / abstract (and any methods subsection): The central claim of a 'comprehensive taxonomy' and 'systematic review' is load-bearing for the paper's contribution, yet no literature-search protocol, database list, keyword set, inclusion/exclusion criteria, or temporal cutoff is stated. In a field that publishes hundreds of safety papers annually, the absence of such details prevents assessment of whether categories (e.g., multi-agent collusion or hardware-side extraction) are exhaustively covered or whether the taxonomy is representative.

    Authors: We agree that an explicit description of the literature collection process would allow readers to better evaluate the scope and representativeness of the taxonomy. In the revised manuscript we will add a dedicated 'Survey Methodology' subsection (placed after the introduction) that specifies: (1) primary sources (arXiv, Google Scholar, proceedings of NeurIPS/ICLR/CVPR/ACL/EMNLP/USENIX Security, and selected workshops); (2) keyword combinations used for each model family and threat category; (3) inclusion criteria (peer-reviewed or preprint works from 2020 onward that directly address safety threats to the six model/agent types listed in the abstract); (4) exclusion criteria (purely theoretical works without empirical safety analysis, non-English papers, and duplicates); and (5) the effective cutoff (literature indexed through December 2024). We will also note that the taxonomy reflects the dominant threats discussed in the surveyed literature rather than claiming exhaustive coverage of every emerging sub-area (e.g., multi-agent collusion or hardware-side extraction), and we will explicitly flag these as open directions in the challenges section. This addition directly addresses the load-bearing claim without altering the core contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey with no derivations or self-referential claims

full rationale

This is a survey paper whose central contributions are a taxonomy of safety threats drawn from existing literature, a review of defenses and benchmarks, and discussion of open challenges. No equations, fitted parameters, predictions, or derivation chains exist in the document. The taxonomy is presented as a synthesis of prior work rather than derived from any internal definition or self-citation that reduces to the paper's own inputs. The claim of comprehensiveness is an editorial judgment about coverage, not a mathematical or statistical reduction that can be shown equivalent to the inputs by construction. Therefore the paper is self-contained as a review and receives score 0 with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the central claim rests on the assumption of comprehensive literature coverage rather than new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 6007 in / 975 out tokens · 31384 ms · 2026-05-23T04:37:12.046877+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

    cs.CY 2026-04 accept novelty 8.0

    This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...

  2. Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

    cs.CR 2026-05 conditional novelty 7.0

    ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

  3. ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

    cs.CR 2026-04 unverdicted novelty 7.0

    ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...

  4. Safety, Security, and Cognitive Risks in World Models

    cs.CR 2026-04 unverdicted novelty 6.0

    World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...

  5. Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

    cs.CL 2025-11 unverdicted novelty 6.0

    EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.

  6. AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

    cs.AI 2025-09 unverdicted novelty 6.0

    AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.

  7. LeakyCLIP: Extracting Training Data from CLIP

    cs.CR 2025-08 conditional novelty 6.0

    LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.

  8. RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion

    cs.CV 2025-03 unverdicted novelty 6.0

    RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA w...

  9. SoK: Robustness in Large Language Models against Jailbreak Attacks

    cs.CR 2026-05 accept novelty 5.0

    The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 9 Pith papers · 14 internal anchors

  1. [1]

    Patch-fool: Are vision transformers always robust against adversarial perturbations?

    Y . Fu, S. Zhang, S. Wu, C. Wan, and Y . Lin, “Patch-fool: Are vision transformers always robust against adversarial perturbations?” in ICLR, 2022

  2. [2]

    Slowformer: Adversarial attack on compute and energy consumption of efficient vision transformers,

    K. Navaneet, S. A. Koohpayegani, E. Sleiman, and H. Pirsiavash, “Slowformer: Adversarial attack on compute and energy consumption of efficient vision transformers,” in CVPR, 2024

  3. [3]

    Pe-attack: On the universal positional embedding vulnerability in transformer-based models,

    S. Gao, T. Chen, M. He, R. Xu, H. Zhou, and J. Li, “Pe-attack: On the universal positional embedding vulnerability in transformer-based models,” IEEE Transactions on Information Forensics and Security , vol. 19, pp. 9359–9373, 2024

  4. [4]

    Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,

    G. Lovisotto, N. Finnie, M. Munoz, C. K. Mummadi, and J. H. Metzen, “Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,” in CVPR, 2022

  5. [5]

    Towards understanding and improving adversarial robustness of vision transformers,

    S. Jain and T. Dutta, “Towards understanding and improving adversarial robustness of vision transformers,” in CVPR, 2024

  6. [6]

    On improving adversarial transferability of vision transformers,

    M. Naseer, K. Ranasinghe, S. Khan, F. S. Khan, and F. Porikli, “On improving adversarial transferability of vision transformers,” arXiv preprint arXiv:2106.04169, 2021

  7. [7]

    Gen- erating transferable adversarial examples against vision transformers,

    Y . Wang, J. Wang, Z. Yin, R. Gong, J. Wang, A. Liu, and X. Liu, “Gen- erating transferable adversarial examples against vision transformers,” in ACM MM, 2022

  8. [8]

    Towards transferable adversarial attacks on vision transformers,

    Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, and Y .-G. Jiang, “Towards transferable adversarial attacks on vision transformers,” in AAAI, 2022

  9. [9]

    Boosting adversarial transferability with learnable patch-wise masks,

    X. Wei and S. Zhao, “Boosting adversarial transferability with learnable patch-wise masks,” IEEE Transactions on Multimedia , vol. 26, pp. 3778–3787, 2023

  10. [10]

    Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,

    W. Ma, Y . Li, X. Jia, and W. Xu, “Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,” in ICCV, 2023

  11. [11]

    Transferable adversarial attacks on vision transformers with token gradient regularization,

    J. Zhang, Y . Huang, W. Wu, and M. R. Lyu, “Transferable adversarial attacks on vision transformers with token gradient regularization,” in CVPR, 2023

  12. [12]

    Improving the adversarial transferability of vision transformers with virtual dense connection,

    J. Zhang, Y . Huang, Z. Xu, W. Wu, and M. R. Lyu, “Improving the adversarial transferability of vision transformers with virtual dense connection,” in AAAI, 2024

  13. [13]

    Attacking transformers with feature diversity adversarial perturbation,

    C. Gao, H. Zhou, J. Yu, Y . Ye, J. Cai, J. Wang, and W. Yang, “Attacking transformers with feature diversity adversarial perturbation,” in AAAI, 2024

  14. [14]

    Decision-based black-box attack against vision transformers via patch-wise adversarial removal,

    Y . Shi, Y . Han, Y .-a. Tan, and X. Kuang, “Decision-based black-box attack against vision transformers via patch-wise adversarial removal,” NeurIPS, 2022

  15. [15]

    Improving transferable targeted adversarial attacks with model self-enhancement,

    H. Wu, G. Ou, W. Wu, and Z. Zheng, “Improving transferable targeted adversarial attacks with model self-enhancement,” in CVPR, 2024

  16. [16]

    Improving transferability of adversarial samples via critical region-oriented feature-level attack,

    Z. Li, M. Ren, F. Jiang, Q. Li, and Z. Sun, “Improving transferability of adversarial samples via critical region-oriented feature-level attack,” IEEE Transactions on Information Forensics and Security , vol. 19, p. 6650–6664, 2024

  17. [17]

    Adversarial token attacks on vision transformers,

    A. Joshi, G. Jagatap, and C. Hegde, “Adversarial token attacks on vision transformers,” arXiv preprint arXiv:2110.04337, 2021

  18. [18]

    Understanding and improving adversarial transferability of vision transformers and convolutional neural networks,

    Z. Chen, C. Xu, H. Lv, S. Liu, and Y . Ji, “Understanding and improving adversarial transferability of vision transformers and convolutional neural networks,” Information Sciences, vol. 648, p. 119474, 2023

  19. [19]

    Towards transferable adversarial attacks on image and video transformers,

    Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, Y .-G. Jiang, and L. S. Davis, “Towards transferable adversarial attacks on image and video transformers,” IEEE Transactions on Image Processing , vol. 32, pp. 6346–6358, 2023

  20. [20]

    Towards efficient adversarial training on vision transformers,

    B. Wu, J. Gu, Z. Li, D. Cai, X. He, and W. Liu, “Towards efficient adversarial training on vision transformers,” in ECCV, 2022

  21. [21]

    Patch vestiges in the adversarial examples against vision trans- former can be leveraged for adversarial detection,

    J. Li, “Patch vestiges in the adversarial examples against vision trans- former can be leveraged for adversarial detection,” in AAAI Workshop, 2022

  22. [22]

    Vitguard: Attention-aware detection against adversarial examples for vision trans- former,

    S. Sun, K. Nwodo, S. Sugrim, A. Stavrou, and H. Wang, “Vitguard: Attention-aware detection against adversarial examples for vision trans- former,”arXiv preprint arXiv:2409.13828, 2024

  23. [23]

    Understanding and defending patched-based adversarial attacks for vision transformer,

    L. Liu, Y . Guo, Y . Zhang, and J. Yang, “Understanding and defending patched-based adversarial attacks for vision transformer,” in ICML, 2023

  24. [24]

    Diffusion models demand contrastive guidance for adversarial purifi- cation to advance,

    M. Bai, W. Huang, T. Li, A. Wang, J. Gao, C. F. Caiafa, and Q. Zhao, “Diffusion models demand contrastive guidance for adversarial purifi- cation to advance,” in ICML, 2024

  25. [25]

    Adbm: Adversarial diffusion bridge model for reliable adversarial purification,

    X. Li, W. Sun, H. Chen, Q. Li, Y . Liu, Y . He, J. Shi, and X. Hu, “Adbm: Adversarial diffusion bridge model for reliable adversarial purification,” arXiv preprint arXiv:2408.00315, 2024. 47

  26. [26]

    Instant adversarial purification with adversarial consistency distillation,

    C. T. Lei, H. M. Yam, Z. Guo, and C. P. Lau, “Instant adversarial purification with adversarial consistency distillation,” arXiv preprint arXiv:2408.17064, 2024

  27. [27]

    Are vision transformers robust to patch perturbations?

    J. Gu, V . Tresp, and Y . Qin, “Are vision transformers robust to patch perturbations?” in ECCV, 2022

  28. [28]

    When adversarial train- ing meets vision transformers: Recipes from training to architecture,

    Y . Mo, D. Wu, Y . Wang, Y . Guo, and Y . Wang, “When adversarial train- ing meets vision transformers: Recipes from training to architecture,” NeurIPS, 2022

  29. [29]

    Robustifying token attention for vision transformers,

    Y . Guo, D. Stutz, and B. Schiele, “Robustifying token attention for vision transformers,” in ICCV, 2023

  30. [30]

    Improving robustness of vision transformers by reducing sensitivity to patch corruptions,

    Y . Y . Guo, D. L. Stutz, and B. T. Schiele, “Improving robustness of vision transformers by reducing sensitivity to patch corruptions,” in CVPR, 2023

  31. [31]

    Improving interpretation faithfulness for vision transformers,

    L. Hu, Y . Liu, N. Liu, M. Huai, L. Sun, and D. Wang, “Improving interpretation faithfulness for vision transformers,” in Proc. Int. Conf. Mach. Learn., 2024

  32. [32]

    Random entangled tokens for adversarially robust vision transformer,

    H. Gong, M. Dong, S. Ma, S. Camtepe, S. Nepal, and C. Xu, “Random entangled tokens for adversarially robust vision transformer,” in CVPR, 2024

  33. [33]

    Diffusion models for adversarial purification,

    W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandkumar, “Diffusion models for adversarial purification,” in ICML, 2022

  34. [34]

    Purify++: Improving diffusion- purification with advanced diffusion models and control of random- ness,

    B. Zhang, W. Luo, and Z. Zhang, “Purify++: Improving diffusion- purification with advanced diffusion models and control of random- ness,” arXiv preprint arXiv:2310.18762, 2023

  35. [35]

    Diffilter: Defending against adversarial perturbations with diffusion filter,

    Y . Chen, X. Li, X. Wang, P. Hu, and D. Peng, “Diffilter: Defending against adversarial perturbations with diffusion filter,” IEEE Transac- tions on Information Forensics and Security , vol. 19, pp. 6779–6794, 2024

  36. [36]

    Mimicdiffusion: Purifying ad- versarial perturbation via mimicking clean diffusion model,

    K. Song, H. Lai, Y . Pan, and J. Yin, “Mimicdiffusion: Purifying ad- versarial perturbation via mimicking clean diffusion model,” in CVPR, 2024

  37. [37]

    Lightpure: Realtime adversarial image purification for mobile devices using diffusion models,

    H. Khalili, S. Park, V . Li, B. Bright, A. Payani, R. R. Kompella, and N. Sehatbakhsh, “Lightpure: Realtime adversarial image purification for mobile devices using diffusion models,” in ACM MobiCom, 2024

  38. [38]

    Lorid: Low-rank iterative diffusion for adversarial purifi- cation,

    G. Zollicoffer, M. Vu, B. Nebgen, J. Castorena, B. Alexandrov, and M. Bhattarai, “Lorid: Low-rank iterative diffusion for adversarial purifi- cation,” arXiv preprint arXiv:2409.08255, 2024

  39. [39]

    You are catching my attention: Are vision transformers bad learners under backdoor attacks?

    Z. Yuan, P. Zhou, K. Zou, and Y . Cheng, “You are catching my attention: Are vision transformers bad learners under backdoor attacks?” inCVPR, 2023

  40. [40]

    Trojvit: Trojan insertion in vision transformers,

    M. Zheng, Q. Lou, and L. Jiang, “Trojvit: Trojan insertion in vision transformers,” in CVPR, 2023

  41. [41]

    Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers,

    S. Yang, J. Bai, K. Gao, Y . Yang, Y . Li, and S.-T. Xia, “Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers,” in CVPR, 2024

  42. [42]

    Dbia: Data-free backdoor attack against transformer networks,

    P. Lv, H. Ma, J. Zhou, R. Liang, K. Chen, S. Zhang, and Y . Yang, “Dbia: Data-free backdoor attack against transformer networks,” in ICME, 2023

  43. [43]

    Multi-trigger backdoor attacks: More triggers, more threats,

    Y . Li, X. Ma, J. He, H. Huang, and Y .-G. Jiang, “Multi-trigger backdoor attacks: More triggers, more threats,” arXiv preprint arXiv:2401.15295, 2024

  44. [44]

    Defending backdoor attacks on vision transformer via patch processing,

    K. D. Doan, Y . Lao, P. Yang, and P. Li, “Defending backdoor attacks on vision transformer via patch processing,” in AAAI, 2023

  45. [45]

    A closer look at robustness of vision transformers to backdoor attacks,

    A. Subramanya, S. A. Koohpayegani, A. Saha, A. Tejankar, and H. Pir- siavash, “A closer look at robustness of vision transformers to backdoor attacks,” in WACV, 2024

  46. [46]

    Backdoor attacks on vision transformers,

    A. Subramanya, A. Saha, S. A. Koohpayegani, A. Tejankar, and H. Pirsiavash, “Backdoor attacks on vision transformers,”arXiv preprint arXiv:2206.08477, 2022

  47. [47]

    Practical region-level attack against segment anything models,

    Y . Shen, Z. Li, and G. Wang, “Practical region-level attack against segment anything models,” in CVPR, 2024

  48. [48]

    Segment (almost) nothing: Prompt-agnostic adversarial attacks on segmentation models,

    F. Croce and M. Hein, “Segment (almost) nothing: Prompt-agnostic adversarial attacks on segmentation models,” in SaTML, 2024

  49. [49]

    Attack-sam: Towards evaluating adversarial robustness of segment anything model,

    C. Zhang, C. Zhang, T. Kang, D. Kim, S.-H. Bae, and I. S. Kweon, “Attack-sam: Towards evaluating adversarial robustness of segment anything model,” arXiv preprint arXiv:2305.00866, 2023

  50. [50]

    Black-box targeted adversarial attack on segment anything (sam),

    S. Zheng and C. Zhang, “Black-box targeted adversarial attack on segment anything (sam),” arXiv preprint arXiv:2310.10010, 2023

  51. [51]

    Unsegment anything by simulating deformation,

    J. Lu, X. Yang, and X. Wang, “Unsegment anything by simulating deformation,” in CVPR, 2024

  52. [52]

    Transferable adversarial attacks on sam and its downstream models,

    S. Xia, W. Yang, Y . Yu, X. Lin, H. Ding, L. Duan, and X. Jiang, “Transferable adversarial attacks on sam and its downstream models,” in NeurIPS, 2024

  53. [53]

    Segment anything meets universal adversarial perturbation,

    D. Han, S. Zheng, and C. Zhang, “Segment anything meets universal adversarial perturbation,” arXiv preprint arXiv:2310.12431, 2023

  54. [54]

    Darksam: Fooling segment anything model to segment nothing,

    Z. Zhou, Y . Song, M. Li, S. Hu, X. Wang, L. Y . Zhang, D. Yao, and H. Jin, “Darksam: Fooling segment anything model to segment nothing,” in NeurIPS, 2024

  55. [55]

    Asam: Boosting segment anything model with adversarial tuning,

    B. Li, H. Xiao, and L. Tang, “Asam: Boosting segment anything model with adversarial tuning,” in CVPR, 2024

  56. [56]

    Badsam: Exploring security vulnerabilities of sam via backdoor attacks (student abstract),

    Z. Guan, M. Hu, Z. Zhou, J. Zhang, S. Li, and N. Liu, “Badsam: Exploring security vulnerabilities of sam via backdoor attacks (student abstract),” in AAAI, 2024

  57. [57]

    Unseg: One universal unlearnable example generator is enough against all image segmentation,

    Y . Sun, H. Zhang, T. Zhang, X. Ma, and Y .-G. Jiang, “Unseg: One universal unlearnable example generator is enough against all image segmentation,” in NeurIPS, 2024

  58. [58]

    Bad charac- ters: Imperceptible nlp attacks,

    N. Boucher, I. Shumailov, R. Anderson, and N. Papernot, “Bad charac- ters: Imperceptible nlp attacks,” in IEEE S&P, 2022

  59. [59]

    Is bert really robust? a strong baseline for natural language attack on text classification and entailment,

    D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in AAAI, 2020

  60. [60]

    Bert-attack: Adversarial attack against bert using bert,

    L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu, “Bert-attack: Adversarial attack against bert using bert,” in EMNLP, 2020

  61. [61]

    Gradient-based adversarial attacks against text transformers,

    C. Guo, A. Sablayrolles, H. Jégou, and D. Kiela, “Gradient-based adversarial attacks against text transformers,” in EMNLP, 2021

  62. [62]

    Breaking bert: Understanding its vulnerabilities for named entity recognition through adversarial attack,

    A. Dirkson, S. Verberne, and W. Kraaij, “Breaking bert: Understanding its vulnerabilities for named entity recognition through adversarial attack,” arXiv preprint arXiv:2109.11308, 2021

  63. [63]

    Gradient-based word substitution for obstinate adversarial examples generation in language models,

    Y . Wang, P. Shi, and H. Zhang, “Gradient-based word substitution for obstinate adversarial examples generation in language models,” arXiv preprint arXiv:2307.12507, 2023

  64. [64]

    Expanding scope: Adapting english adver- sarial attacks to chinese,

    H. Liu, C. Cai, and Y . Qi, “Expanding scope: Adapting english adver- sarial attacks to chinese,” in TrustNLP, 2023

  65. [65]

    Adversarial demonstration attacks on large language models,

    J. Wang, Z. Liu, K. H. Park, Z. Jiang, Z. Zheng, Z. Wu, M. Chen, and C. Xiao, “Adversarial demonstration attacks on large language models,” arXiv preprint arXiv:2305.14950, 2023

  66. [66]

    Adversarial attacks on large language model-based system and mitigating strategies: A case study on chatgpt,

    B. Liu, B. Xiao, X. Jiang, S. Cen, X. He, and W. Dou, “Adversarial attacks on large language model-based system and mitigating strategies: A case study on chatgpt,” Security and Communication Networks , vol. 2023, p. 10, 2023

  67. [67]

    Adversarial attacks on tables with entity swap,

    A. Koleva, M. Ringsquandl, and V . Tresp, “Adversarial attacks on tables with entity swap,” arXiv preprint arXiv:2309.08650, 2023

  68. [68]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023

  69. [69]

    Certifying llm safety against adversar- ial prompting,

    A. Kumar, C. Agarwal, S. Srinivas, S. Feizi, and H. Lakkaraju, “Certifying llm safety against adversarial prompting,” arXiv preprint arXiv:2309.02705, 2023

  70. [70]

    Improving alignment and robustness with circuit breakers,

    A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks, “Improving alignment and robustness with circuit breakers,” in NeurIPS, 2024

  71. [71]

    Low-resource languages jailbreak gpt-4,

    Z.-X. Yong, C. Menghini, and S. H. Bach, “Low-resource languages jailbreak gpt-4,” in NeurIPS Workshop, 2023

  72. [72]

    Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,

    Y . Yuan, W. Jiao, W. Wang, J.-t. Huang, P. He, S. Shi, and Z. Tu, “Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,” arXiv preprint arXiv:2308.06463, 2023

  73. [73]

    Jailbroken: How does llm safety training fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” NeurIPS, 2024

  74. [74]

    A cross-language investigation into jailbreak attacks in large language models,

    J. Li, Y . Liu, C. Liu, L. Shi, X. Ren, Y . Zheng, Y . Liu, and Y . Xue, “A cross-language investigation into jailbreak attacks in large language models,” arXiv preprint arXiv:2401.16765, 2024

  75. [75]

    Easyjailbreak: A unified framework for jailbreaking large language models,

    W. Zhou, X. Wang, L. Xiong, H. Xia, Y . Gu, M. Chai, F. Zhu, C. Huang, S. Dou, Z. Xi et al., “Easyjailbreak: A unified framework for jailbreaking large language models,” arXiv preprint arXiv:2403.12171, 2024

  76. [76]

    Is the system message really important to jailbreaks in large language models?

    X. Zou, Y . Chen, and K. Li, “Is the system message really important to jailbreaks in large language models?” arXiv preprint arXiv:2402.14857, 2024

  77. [77]

    Tastle: Distract large language models for automatic jailbreak attack,

    Z. Xiao, Y . Yang, G. Chen, and Y . Chen, “Tastle: Distract large language models for automatic jailbreak attack,” in EMNLP, 2024

  78. [78]

    Structuralsleight: Automated jailbreak attacks on large language models utilizing uncommon text-encoded structure,

    B. Li, H. Xing, C. Huang, J. Qian, H. Xiao, L. Feng, and C. Tian, “Structuralsleight: Automated jailbreak attacks on large language models utilizing uncommon text-encoded structure,” arXiv preprint arXiv:2406.08754, 2024

  79. [79]

    Codechameleon: Personalized encryption framework for jailbreaking large language models,

    H. Lv, X. Wang, Y . Zhang, C. Huang, S. Dou, J. Ye, T. Gui, Q. Zhang, and X. Huang, “Codechameleon: Personalized encryption framework for jailbreaking large language models,” arXiv preprint arXiv:2402.16717, 2024

  80. [80]

    Play guessing game with llm: Indirect jailbreak attack with implicit clues,

    Z. Chang, M. Li, Y . Liu, J. Wang, Q. Wang, and Y . Liu, “Play guessing game with llm: Indirect jailbreak attack with implicit clues,” in ACL, 2024. 48

Showing first 80 references.