Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Baoyuan Wu; Bo Li; Chaowei Xiao; Cihang Xie; Cong Wang; Dacheng Tao; Hanxun Huang; Haonan Li; Hengyuan Xu; James Bailey

arxiv: 2502.05206 · v6 · submitted 2025-02-02 · 💻 cs.CR · cs.AI· cs.CL· cs.CV

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Xingjun Ma , Yifeng Gao , Yixu Wang , Ruofan Wang , Xin Wang , Ye Sun , Yifan Ding , Hengyuan Xu

show 40 more authors

Yunhao Chen Yunhan Zhao Hanxun Huang Yige Li Yutao Wu Jiaming Zhang Xiang Zheng Yang Bai Zuxuan Wu Xipeng Qiu Jingfeng Zhang Yiming Li Xudong Han Haonan Li Jun Sun Cong Wang Jindong Gu Baoyuan Wu Siheng Chen Tianwei Zhang Yang Liu Mingming Gong Tongliang Liu Shirui Pan Cihang Xie Tianyu Pang Yinpeng Dong Ruoxi Jia Yang Zhang Shiqing Ma Xiangyu Zhang Neil Gong Chaowei Xiao Sarah Erfani Tim Baldwin Bo Li Masashi Sugiyama Dacheng Tao James Bailey Yu-Gang Jiang

This is my paper

Pith reviewed 2026-05-23 04:37 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.CV

keywords large language modelsvision-language modelsadversarial attacksjailbreak attacksmodel safetybackdoor attacksagent safetydiffusion models

0 comments

The pith

Large models and agents face categorized safety threats from adversarial attacks to jailbreaks and agent-specific risks, with defenses and open challenges reviewed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey organizes safety threats to vision foundation models, large language models, vision-language models, diffusion models, and agents into a taxonomy that includes adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. It reviews available defense strategies for each threat type and summarizes commonly used datasets and benchmarks. The work then identifies open challenges such as the need for comprehensive safety evaluations, scalable defense mechanisms, and sustainable data practices, while calling for collective research community efforts and international collaboration to build better defense systems.

Core claim

The paper presents a comprehensive taxonomy of safety threats to large models and agents, reviews defense strategies where available, summarizes datasets and benchmarks, and discusses open challenges including comprehensive safety evaluations, scalable defenses, and sustainable data practices, along with the necessity of collective efforts.

What carries the argument

the taxonomy of safety threats covering adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and agent-specific threats

If this is right

Defense strategies can be systematically developed and matched to each threat category in the taxonomy.
Researchers gain a shared reference for selecting datasets and benchmarks in safety evaluations.
Open challenges highlight priorities for future work on scalable defenses and sustainable data practices.
Collective community efforts are positioned as necessary to create comprehensive defense platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a starting point for automated tools that scan new papers for emerging threats not yet listed.
Similar survey structures might be applied to safety in other rapidly scaling AI domains such as robotics or multimodal systems.
If agent-specific threats grow in importance, the taxonomy may need periodic updates to remain current.

Load-bearing premise

The survey's taxonomy and coverage are assumed to be representative of the full landscape of large model safety research without major omissions in a fast-moving field.

What would settle it

Identification of a major, previously undocumented category of safety threat to large models or agents that falls outside the proposed taxonomy would undermine the claim of comprehensive coverage.

Figures

Figures reproduced from arXiv: 2502.05206 by Baoyuan Wu, Bo Li, Chaowei Xiao, Cihang Xie, Cong Wang, Dacheng Tao, Hanxun Huang, Haonan Li, Hengyuan Xu, James Bailey, Jiaming Zhang, Jindong Gu, Jingfeng Zhang, Jun Sun, Masashi Sugiyama, Mingming Gong, Neil Gong, Ruofan Wang, Ruoxi Jia, Sarah Erfani, Shiqing Ma, Shirui Pan, Siheng Chen, Tianwei Zhang, Tianyu Pang, Tim Baldwin, Tongliang Liu, Xiangyu Zhang, Xiang Zheng, Xingjun Ma, Xin Wang, Xipeng Qiu, Xudong Han, Yang Bai, Yang Liu, Yang Zhang, Ye Sun, Yifan Ding, Yifeng Gao, Yige Li, Yiming Li, Yinpeng Dong, Yixu Wang, Yu-Gang Jiang, Yunhan Zhao, Yunhao Chen, Yutao Wu, Zuxuan Wu.

**Figure 1.** Figure 1: Left: The number of surveyed technical papers on attacks, defenses, and benchmarks/datasets. Middle: Distribution of surveyed technical papers by model type. Right: Distribution of surveyed technical papers by attack and defense type. Trend: odel - Attack/Defense Number of works Attack Defense ～2021 2022 2023 2024 July 2025 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Left: The quarterly trend in the number of surveyed safety papers across different models; Middle: Proportional distribution of attack and defense studies associated with large models. Right: Annual trend in the number of surveyed safety papers on various attacks and defenses, ordered from most to least studied. models is paramount to prevent such unintended consequences, maintain public trust, and promote… view at source ↗

**Figure 3.** Figure 3: A road map of this survey [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard literature survey that organizes known safety threats across large models and agents without new results or a verifiable search method.

read the letter

This survey pulls together existing work on safety issues for vision foundation models, LLMs, VLMs, diffusion models, and agents. It gives a taxonomy of threats (adversarial, poisoning, backdoors, jailbreaks, extraction, energy attacks, and some agent-specific ones), reviews defenses where they exist, lists benchmarks, and notes open challenges like better evaluation and scalable protections. The listed categories line up with what has already appeared in the literature over the last few years. The main service is collecting that material in one place and flagging the need for community-wide efforts on evaluation and data practices. That organizational step can save time for readers who want an entry point. The central weakness is the lack of any stated search protocol, date range, or inclusion criteria. Without that, it is difficult to judge whether the taxonomy is complete or whether fast-moving areas such as multi-agent collusion or hardware extraction vectors were missed. Surveys in this space live or die on coverage and summary accuracy, and both are hard to assess from the abstract alone. The paper does not claim new derivations, experiments, or first-principles analysis, so its value stays at the level of a reference compilation. It is the sort of document that helps a new PhD student or a practitioner get oriented, but it will not change how active researchers in the area think about the problems. A serious editor could send it for review if the full text shows careful, balanced summaries and reasonable coverage up to a clear cutoff; otherwise it risks being another broad but shallow overview. I would not cite it as a primary source, but I might point a student to it for background.

Referee Report

1 major / 0 minor

Summary. The paper presents a systematic survey of safety issues in large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. It contributes (1) a taxonomy of safety threats including adversarial attacks, data poisoning, backdoor attacks, jailbreak/prompt injection, energy-latency attacks, data/model extraction, and agent-specific threats; (2) a review of corresponding defense strategies and commonly used datasets/benchmarks; and (3) discussion of open challenges such as comprehensive evaluations, scalable defenses, and sustainable data practices, while calling for community and international collaboration.

Significance. If the taxonomy and coverage prove representative, the survey would provide a structured reference that organizes a fast-moving literature and surfaces actionable open problems (e.g., scalable defenses and multi-agent threats). The explicit inclusion of agent-specific threats and the call for collective efforts are constructive organizational contributions typical of useful surveys in the field.

major comments (1)

[Introduction / abstract] Introduction / abstract (and any methods subsection): The central claim of a 'comprehensive taxonomy' and 'systematic review' is load-bearing for the paper's contribution, yet no literature-search protocol, database list, keyword set, inclusion/exclusion criteria, or temporal cutoff is stated. In a field that publishes hundreds of safety papers annually, the absence of such details prevents assessment of whether categories (e.g., multi-agent collusion or hardware-side extraction) are exhaustively covered or whether the taxonomy is representative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on methodological transparency. We address the concern below and will revise the manuscript to strengthen the presentation of our survey process.

read point-by-point responses

Referee: [Introduction / abstract] Introduction / abstract (and any methods subsection): The central claim of a 'comprehensive taxonomy' and 'systematic review' is load-bearing for the paper's contribution, yet no literature-search protocol, database list, keyword set, inclusion/exclusion criteria, or temporal cutoff is stated. In a field that publishes hundreds of safety papers annually, the absence of such details prevents assessment of whether categories (e.g., multi-agent collusion or hardware-side extraction) are exhaustively covered or whether the taxonomy is representative.

Authors: We agree that an explicit description of the literature collection process would allow readers to better evaluate the scope and representativeness of the taxonomy. In the revised manuscript we will add a dedicated 'Survey Methodology' subsection (placed after the introduction) that specifies: (1) primary sources (arXiv, Google Scholar, proceedings of NeurIPS/ICLR/CVPR/ACL/EMNLP/USENIX Security, and selected workshops); (2) keyword combinations used for each model family and threat category; (3) inclusion criteria (peer-reviewed or preprint works from 2020 onward that directly address safety threats to the six model/agent types listed in the abstract); (4) exclusion criteria (purely theoretical works without empirical safety analysis, non-English papers, and duplicates); and (5) the effective cutoff (literature indexed through December 2024). We will also note that the taxonomy reflects the dominant threats discussed in the surveyed literature rather than claiming exhaustive coverage of every emerging sub-area (e.g., multi-agent collusion or hardware-side extraction), and we will explicitly flag these as open directions in the challenges section. This addition directly addresses the load-bearing claim without altering the core contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey with no derivations or self-referential claims

full rationale

This is a survey paper whose central contributions are a taxonomy of safety threats drawn from existing literature, a review of defenses and benchmarks, and discussion of open challenges. No equations, fitted parameters, predictions, or derivation chains exist in the document. The taxonomy is presented as a synthesis of prior work rather than derived from any internal definition or self-citation that reduces to the paper's own inputs. The claim of comprehensiveness is an editorial judgment about coverage, not a mathematical or statistical reduction that can be shown equivalent to the inputs by construction. Therefore the paper is self-contained as a review and receives score 0 with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the central claim rests on the assumption of comprehensive literature coverage rather than new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 6007 in / 975 out tokens · 31384 ms · 2026-05-23T04:37:12.046877+00:00 · methodology

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
cs.CY 2026-04 accept novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
cs.CR 2026-05 conditional novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
Safety, Security, and Cognitive Risks in World Models
cs.CR 2026-04 unverdicted novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
cs.CL 2025-11 unverdicted novelty 6.0

EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models
cs.AI 2025-09 unverdicted novelty 6.0

AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.
LeakyCLIP: Extracting Training Data from CLIP
cs.CR 2025-08 conditional novelty 6.0

LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.
RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion
cs.CV 2025-03 unverdicted novelty 6.0

RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA w...
SoK: Robustness in Large Language Models against Jailbreak Attacks
cs.CR 2026-05 accept novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 9 Pith papers · 14 internal anchors

[1]

Patch-fool: Are vision transformers always robust against adversarial perturbations?

Y . Fu, S. Zhang, S. Wu, C. Wan, and Y . Lin, “Patch-fool: Are vision transformers always robust against adversarial perturbations?” in ICLR, 2022

work page 2022
[2]

Slowformer: Adversarial attack on compute and energy consumption of efficient vision transformers,

K. Navaneet, S. A. Koohpayegani, E. Sleiman, and H. Pirsiavash, “Slowformer: Adversarial attack on compute and energy consumption of efficient vision transformers,” in CVPR, 2024

work page 2024
[3]

Pe-attack: On the universal positional embedding vulnerability in transformer-based models,

S. Gao, T. Chen, M. He, R. Xu, H. Zhou, and J. Li, “Pe-attack: On the universal positional embedding vulnerability in transformer-based models,” IEEE Transactions on Information Forensics and Security , vol. 19, pp. 9359–9373, 2024

work page 2024
[4]

Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,

G. Lovisotto, N. Finnie, M. Munoz, C. K. Mummadi, and J. H. Metzen, “Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,” in CVPR, 2022

work page 2022
[5]

Towards understanding and improving adversarial robustness of vision transformers,

S. Jain and T. Dutta, “Towards understanding and improving adversarial robustness of vision transformers,” in CVPR, 2024

work page 2024
[6]

On improving adversarial transferability of vision transformers,

M. Naseer, K. Ranasinghe, S. Khan, F. S. Khan, and F. Porikli, “On improving adversarial transferability of vision transformers,” arXiv preprint arXiv:2106.04169, 2021

work page arXiv 2021
[7]

Gen- erating transferable adversarial examples against vision transformers,

Y . Wang, J. Wang, Z. Yin, R. Gong, J. Wang, A. Liu, and X. Liu, “Gen- erating transferable adversarial examples against vision transformers,” in ACM MM, 2022

work page 2022
[8]

Towards transferable adversarial attacks on vision transformers,

Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, and Y .-G. Jiang, “Towards transferable adversarial attacks on vision transformers,” in AAAI, 2022

work page 2022
[9]

Boosting adversarial transferability with learnable patch-wise masks,

X. Wei and S. Zhao, “Boosting adversarial transferability with learnable patch-wise masks,” IEEE Transactions on Multimedia , vol. 26, pp. 3778–3787, 2023

work page 2023
[10]

Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,

W. Ma, Y . Li, X. Jia, and W. Xu, “Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,” in ICCV, 2023

work page 2023
[11]

Transferable adversarial attacks on vision transformers with token gradient regularization,

J. Zhang, Y . Huang, W. Wu, and M. R. Lyu, “Transferable adversarial attacks on vision transformers with token gradient regularization,” in CVPR, 2023

work page 2023
[12]

Improving the adversarial transferability of vision transformers with virtual dense connection,

J. Zhang, Y . Huang, Z. Xu, W. Wu, and M. R. Lyu, “Improving the adversarial transferability of vision transformers with virtual dense connection,” in AAAI, 2024

work page 2024
[13]

Attacking transformers with feature diversity adversarial perturbation,

C. Gao, H. Zhou, J. Yu, Y . Ye, J. Cai, J. Wang, and W. Yang, “Attacking transformers with feature diversity adversarial perturbation,” in AAAI, 2024

work page 2024
[14]

Decision-based black-box attack against vision transformers via patch-wise adversarial removal,

Y . Shi, Y . Han, Y .-a. Tan, and X. Kuang, “Decision-based black-box attack against vision transformers via patch-wise adversarial removal,” NeurIPS, 2022

work page 2022
[15]

Improving transferable targeted adversarial attacks with model self-enhancement,

H. Wu, G. Ou, W. Wu, and Z. Zheng, “Improving transferable targeted adversarial attacks with model self-enhancement,” in CVPR, 2024

work page 2024
[16]

Improving transferability of adversarial samples via critical region-oriented feature-level attack,

Z. Li, M. Ren, F. Jiang, Q. Li, and Z. Sun, “Improving transferability of adversarial samples via critical region-oriented feature-level attack,” IEEE Transactions on Information Forensics and Security , vol. 19, p. 6650–6664, 2024

work page 2024
[17]

Adversarial token attacks on vision transformers,

A. Joshi, G. Jagatap, and C. Hegde, “Adversarial token attacks on vision transformers,” arXiv preprint arXiv:2110.04337, 2021

work page arXiv 2021
[18]

Understanding and improving adversarial transferability of vision transformers and convolutional neural networks,

Z. Chen, C. Xu, H. Lv, S. Liu, and Y . Ji, “Understanding and improving adversarial transferability of vision transformers and convolutional neural networks,” Information Sciences, vol. 648, p. 119474, 2023

work page 2023
[19]

Towards transferable adversarial attacks on image and video transformers,

Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, Y .-G. Jiang, and L. S. Davis, “Towards transferable adversarial attacks on image and video transformers,” IEEE Transactions on Image Processing , vol. 32, pp. 6346–6358, 2023

work page 2023
[20]

Towards efficient adversarial training on vision transformers,

B. Wu, J. Gu, Z. Li, D. Cai, X. He, and W. Liu, “Towards efficient adversarial training on vision transformers,” in ECCV, 2022

work page 2022
[21]

Patch vestiges in the adversarial examples against vision trans- former can be leveraged for adversarial detection,

J. Li, “Patch vestiges in the adversarial examples against vision trans- former can be leveraged for adversarial detection,” in AAAI Workshop, 2022

work page 2022
[22]

Vitguard: Attention-aware detection against adversarial examples for vision trans- former,

S. Sun, K. Nwodo, S. Sugrim, A. Stavrou, and H. Wang, “Vitguard: Attention-aware detection against adversarial examples for vision trans- former,”arXiv preprint arXiv:2409.13828, 2024

work page arXiv 2024
[23]

Understanding and defending patched-based adversarial attacks for vision transformer,

L. Liu, Y . Guo, Y . Zhang, and J. Yang, “Understanding and defending patched-based adversarial attacks for vision transformer,” in ICML, 2023

work page 2023
[24]

Diffusion models demand contrastive guidance for adversarial purifi- cation to advance,

M. Bai, W. Huang, T. Li, A. Wang, J. Gao, C. F. Caiafa, and Q. Zhao, “Diffusion models demand contrastive guidance for adversarial purifi- cation to advance,” in ICML, 2024

work page 2024
[25]

Adbm: Adversarial diffusion bridge model for reliable adversarial purification,

X. Li, W. Sun, H. Chen, Q. Li, Y . Liu, Y . He, J. Shi, and X. Hu, “Adbm: Adversarial diffusion bridge model for reliable adversarial purification,” arXiv preprint arXiv:2408.00315, 2024. 47

work page arXiv 2024
[26]

Instant adversarial purification with adversarial consistency distillation,

C. T. Lei, H. M. Yam, Z. Guo, and C. P. Lau, “Instant adversarial purification with adversarial consistency distillation,” arXiv preprint arXiv:2408.17064, 2024

work page arXiv 2024
[27]

Are vision transformers robust to patch perturbations?

J. Gu, V . Tresp, and Y . Qin, “Are vision transformers robust to patch perturbations?” in ECCV, 2022

work page 2022
[28]

When adversarial train- ing meets vision transformers: Recipes from training to architecture,

Y . Mo, D. Wu, Y . Wang, Y . Guo, and Y . Wang, “When adversarial train- ing meets vision transformers: Recipes from training to architecture,” NeurIPS, 2022

work page 2022
[29]

Robustifying token attention for vision transformers,

Y . Guo, D. Stutz, and B. Schiele, “Robustifying token attention for vision transformers,” in ICCV, 2023

work page 2023
[30]

Improving robustness of vision transformers by reducing sensitivity to patch corruptions,

Y . Y . Guo, D. L. Stutz, and B. T. Schiele, “Improving robustness of vision transformers by reducing sensitivity to patch corruptions,” in CVPR, 2023

work page 2023
[31]

Improving interpretation faithfulness for vision transformers,

L. Hu, Y . Liu, N. Liu, M. Huai, L. Sun, and D. Wang, “Improving interpretation faithfulness for vision transformers,” in Proc. Int. Conf. Mach. Learn., 2024

work page 2024
[32]

Random entangled tokens for adversarially robust vision transformer,

H. Gong, M. Dong, S. Ma, S. Camtepe, S. Nepal, and C. Xu, “Random entangled tokens for adversarially robust vision transformer,” in CVPR, 2024

work page 2024
[33]

Diffusion models for adversarial purification,

W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandkumar, “Diffusion models for adversarial purification,” in ICML, 2022

work page 2022
[34]

Purify++: Improving diffusion- purification with advanced diffusion models and control of random- ness,

B. Zhang, W. Luo, and Z. Zhang, “Purify++: Improving diffusion- purification with advanced diffusion models and control of random- ness,” arXiv preprint arXiv:2310.18762, 2023

work page arXiv 2023
[35]

Diffilter: Defending against adversarial perturbations with diffusion filter,

Y . Chen, X. Li, X. Wang, P. Hu, and D. Peng, “Diffilter: Defending against adversarial perturbations with diffusion filter,” IEEE Transac- tions on Information Forensics and Security , vol. 19, pp. 6779–6794, 2024

work page 2024
[36]

Mimicdiffusion: Purifying ad- versarial perturbation via mimicking clean diffusion model,

K. Song, H. Lai, Y . Pan, and J. Yin, “Mimicdiffusion: Purifying ad- versarial perturbation via mimicking clean diffusion model,” in CVPR, 2024

work page 2024
[37]

Lightpure: Realtime adversarial image purification for mobile devices using diffusion models,

H. Khalili, S. Park, V . Li, B. Bright, A. Payani, R. R. Kompella, and N. Sehatbakhsh, “Lightpure: Realtime adversarial image purification for mobile devices using diffusion models,” in ACM MobiCom, 2024

work page 2024
[38]

Lorid: Low-rank iterative diffusion for adversarial purifi- cation,

G. Zollicoffer, M. Vu, B. Nebgen, J. Castorena, B. Alexandrov, and M. Bhattarai, “Lorid: Low-rank iterative diffusion for adversarial purifi- cation,” arXiv preprint arXiv:2409.08255, 2024

work page arXiv 2024
[39]

You are catching my attention: Are vision transformers bad learners under backdoor attacks?

Z. Yuan, P. Zhou, K. Zou, and Y . Cheng, “You are catching my attention: Are vision transformers bad learners under backdoor attacks?” inCVPR, 2023

work page 2023
[40]

Trojvit: Trojan insertion in vision transformers,

M. Zheng, Q. Lou, and L. Jiang, “Trojvit: Trojan insertion in vision transformers,” in CVPR, 2023

work page 2023
[41]

Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers,

S. Yang, J. Bai, K. Gao, Y . Yang, Y . Li, and S.-T. Xia, “Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers,” in CVPR, 2024

work page 2024
[42]

Dbia: Data-free backdoor attack against transformer networks,

P. Lv, H. Ma, J. Zhou, R. Liang, K. Chen, S. Zhang, and Y . Yang, “Dbia: Data-free backdoor attack against transformer networks,” in ICME, 2023

work page 2023
[43]

Multi-trigger backdoor attacks: More triggers, more threats,

Y . Li, X. Ma, J. He, H. Huang, and Y .-G. Jiang, “Multi-trigger backdoor attacks: More triggers, more threats,” arXiv preprint arXiv:2401.15295, 2024

work page arXiv 2024
[44]

Defending backdoor attacks on vision transformer via patch processing,

K. D. Doan, Y . Lao, P. Yang, and P. Li, “Defending backdoor attacks on vision transformer via patch processing,” in AAAI, 2023

work page 2023
[45]

A closer look at robustness of vision transformers to backdoor attacks,

A. Subramanya, S. A. Koohpayegani, A. Saha, A. Tejankar, and H. Pir- siavash, “A closer look at robustness of vision transformers to backdoor attacks,” in WACV, 2024

work page 2024
[46]

Backdoor attacks on vision transformers,

A. Subramanya, A. Saha, S. A. Koohpayegani, A. Tejankar, and H. Pirsiavash, “Backdoor attacks on vision transformers,”arXiv preprint arXiv:2206.08477, 2022

work page arXiv 2022
[47]

Practical region-level attack against segment anything models,

Y . Shen, Z. Li, and G. Wang, “Practical region-level attack against segment anything models,” in CVPR, 2024

work page 2024
[48]

Segment (almost) nothing: Prompt-agnostic adversarial attacks on segmentation models,

F. Croce and M. Hein, “Segment (almost) nothing: Prompt-agnostic adversarial attacks on segmentation models,” in SaTML, 2024

work page 2024
[49]

Attack-sam: Towards evaluating adversarial robustness of segment anything model,

C. Zhang, C. Zhang, T. Kang, D. Kim, S.-H. Bae, and I. S. Kweon, “Attack-sam: Towards evaluating adversarial robustness of segment anything model,” arXiv preprint arXiv:2305.00866, 2023

work page arXiv 2023
[50]

Black-box targeted adversarial attack on segment anything (sam),

S. Zheng and C. Zhang, “Black-box targeted adversarial attack on segment anything (sam),” arXiv preprint arXiv:2310.10010, 2023

work page arXiv 2023
[51]

Unsegment anything by simulating deformation,

J. Lu, X. Yang, and X. Wang, “Unsegment anything by simulating deformation,” in CVPR, 2024

work page 2024
[52]

Transferable adversarial attacks on sam and its downstream models,

S. Xia, W. Yang, Y . Yu, X. Lin, H. Ding, L. Duan, and X. Jiang, “Transferable adversarial attacks on sam and its downstream models,” in NeurIPS, 2024

work page 2024
[53]

Segment anything meets universal adversarial perturbation,

D. Han, S. Zheng, and C. Zhang, “Segment anything meets universal adversarial perturbation,” arXiv preprint arXiv:2310.12431, 2023

work page arXiv 2023
[54]

Darksam: Fooling segment anything model to segment nothing,

Z. Zhou, Y . Song, M. Li, S. Hu, X. Wang, L. Y . Zhang, D. Yao, and H. Jin, “Darksam: Fooling segment anything model to segment nothing,” in NeurIPS, 2024

work page 2024
[55]

Asam: Boosting segment anything model with adversarial tuning,

B. Li, H. Xiao, and L. Tang, “Asam: Boosting segment anything model with adversarial tuning,” in CVPR, 2024

work page 2024
[56]

Badsam: Exploring security vulnerabilities of sam via backdoor attacks (student abstract),

Z. Guan, M. Hu, Z. Zhou, J. Zhang, S. Li, and N. Liu, “Badsam: Exploring security vulnerabilities of sam via backdoor attacks (student abstract),” in AAAI, 2024

work page 2024
[57]

Unseg: One universal unlearnable example generator is enough against all image segmentation,

Y . Sun, H. Zhang, T. Zhang, X. Ma, and Y .-G. Jiang, “Unseg: One universal unlearnable example generator is enough against all image segmentation,” in NeurIPS, 2024

work page 2024
[58]

Bad charac- ters: Imperceptible nlp attacks,

N. Boucher, I. Shumailov, R. Anderson, and N. Papernot, “Bad charac- ters: Imperceptible nlp attacks,” in IEEE S&P, 2022

work page 2022
[59]

Is bert really robust? a strong baseline for natural language attack on text classification and entailment,

D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in AAAI, 2020

work page 2020
[60]

Bert-attack: Adversarial attack against bert using bert,

L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu, “Bert-attack: Adversarial attack against bert using bert,” in EMNLP, 2020

work page 2020
[61]

Gradient-based adversarial attacks against text transformers,

C. Guo, A. Sablayrolles, H. Jégou, and D. Kiela, “Gradient-based adversarial attacks against text transformers,” in EMNLP, 2021

work page 2021
[62]

Breaking bert: Understanding its vulnerabilities for named entity recognition through adversarial attack,

A. Dirkson, S. Verberne, and W. Kraaij, “Breaking bert: Understanding its vulnerabilities for named entity recognition through adversarial attack,” arXiv preprint arXiv:2109.11308, 2021

work page arXiv 2021
[63]

Gradient-based word substitution for obstinate adversarial examples generation in language models,

Y . Wang, P. Shi, and H. Zhang, “Gradient-based word substitution for obstinate adversarial examples generation in language models,” arXiv preprint arXiv:2307.12507, 2023

work page arXiv 2023
[64]

Expanding scope: Adapting english adver- sarial attacks to chinese,

H. Liu, C. Cai, and Y . Qi, “Expanding scope: Adapting english adver- sarial attacks to chinese,” in TrustNLP, 2023

work page 2023
[65]

Adversarial demonstration attacks on large language models,

J. Wang, Z. Liu, K. H. Park, Z. Jiang, Z. Zheng, Z. Wu, M. Chen, and C. Xiao, “Adversarial demonstration attacks on large language models,” arXiv preprint arXiv:2305.14950, 2023

work page arXiv 2023
[66]

Adversarial attacks on large language model-based system and mitigating strategies: A case study on chatgpt,

B. Liu, B. Xiao, X. Jiang, S. Cen, X. He, and W. Dou, “Adversarial attacks on large language model-based system and mitigating strategies: A case study on chatgpt,” Security and Communication Networks , vol. 2023, p. 10, 2023

work page 2023
[67]

Adversarial attacks on tables with entity swap,

A. Koleva, M. Ringsquandl, and V . Tresp, “Adversarial attacks on tables with entity swap,” arXiv preprint arXiv:2309.08650, 2023

work page arXiv 2023
[68]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Certifying llm safety against adversar- ial prompting,

A. Kumar, C. Agarwal, S. Srinivas, S. Feizi, and H. Lakkaraju, “Certifying llm safety against adversarial prompting,” arXiv preprint arXiv:2309.02705, 2023

work page arXiv 2023
[70]

Improving alignment and robustness with circuit breakers,

A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks, “Improving alignment and robustness with circuit breakers,” in NeurIPS, 2024

work page 2024
[71]

Low-resource languages jailbreak gpt-4,

Z.-X. Yong, C. Menghini, and S. H. Bach, “Low-resource languages jailbreak gpt-4,” in NeurIPS Workshop, 2023

work page 2023
[72]

Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,

Y . Yuan, W. Jiao, W. Wang, J.-t. Huang, P. He, S. Shi, and Z. Tu, “Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,” arXiv preprint arXiv:2308.06463, 2023

work page arXiv 2023
[73]

Jailbroken: How does llm safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” NeurIPS, 2024

work page 2024
[74]

A cross-language investigation into jailbreak attacks in large language models,

J. Li, Y . Liu, C. Liu, L. Shi, X. Ren, Y . Zheng, Y . Liu, and Y . Xue, “A cross-language investigation into jailbreak attacks in large language models,” arXiv preprint arXiv:2401.16765, 2024

work page arXiv 2024
[75]

Easyjailbreak: A unified framework for jailbreaking large language models,

W. Zhou, X. Wang, L. Xiong, H. Xia, Y . Gu, M. Chai, F. Zhu, C. Huang, S. Dou, Z. Xi et al., “Easyjailbreak: A unified framework for jailbreaking large language models,” arXiv preprint arXiv:2403.12171, 2024

work page arXiv 2024
[76]

Is the system message really important to jailbreaks in large language models?

X. Zou, Y . Chen, and K. Li, “Is the system message really important to jailbreaks in large language models?” arXiv preprint arXiv:2402.14857, 2024

work page arXiv 2024
[77]

Tastle: Distract large language models for automatic jailbreak attack,

Z. Xiao, Y . Yang, G. Chen, and Y . Chen, “Tastle: Distract large language models for automatic jailbreak attack,” in EMNLP, 2024

work page 2024
[78]

Structuralsleight: Automated jailbreak attacks on large language models utilizing uncommon text-encoded structure,

B. Li, H. Xing, C. Huang, J. Qian, H. Xiao, L. Feng, and C. Tian, “Structuralsleight: Automated jailbreak attacks on large language models utilizing uncommon text-encoded structure,” arXiv preprint arXiv:2406.08754, 2024

work page arXiv 2024
[79]

Codechameleon: Personalized encryption framework for jailbreaking large language models,

H. Lv, X. Wang, Y . Zhang, C. Huang, S. Dou, J. Ye, T. Gui, Q. Zhang, and X. Huang, “Codechameleon: Personalized encryption framework for jailbreaking large language models,” arXiv preprint arXiv:2402.16717, 2024

work page arXiv 2024
[80]

Play guessing game with llm: Indirect jailbreak attack with implicit clues,

Z. Chang, M. Li, Y . Liu, J. Wang, Q. Wang, and Y . Liu, “Play guessing game with llm: Indirect jailbreak attack with implicit clues,” in ACL, 2024. 48

work page 2024

Showing first 80 references.

[1] [1]

Patch-fool: Are vision transformers always robust against adversarial perturbations?

Y . Fu, S. Zhang, S. Wu, C. Wan, and Y . Lin, “Patch-fool: Are vision transformers always robust against adversarial perturbations?” in ICLR, 2022

work page 2022

[2] [2]

Slowformer: Adversarial attack on compute and energy consumption of efficient vision transformers,

K. Navaneet, S. A. Koohpayegani, E. Sleiman, and H. Pirsiavash, “Slowformer: Adversarial attack on compute and energy consumption of efficient vision transformers,” in CVPR, 2024

work page 2024

[3] [3]

Pe-attack: On the universal positional embedding vulnerability in transformer-based models,

S. Gao, T. Chen, M. He, R. Xu, H. Zhou, and J. Li, “Pe-attack: On the universal positional embedding vulnerability in transformer-based models,” IEEE Transactions on Information Forensics and Security , vol. 19, pp. 9359–9373, 2024

work page 2024

[4] [4]

Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,

G. Lovisotto, N. Finnie, M. Munoz, C. K. Mummadi, and J. H. Metzen, “Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,” in CVPR, 2022

work page 2022

[5] [5]

Towards understanding and improving adversarial robustness of vision transformers,

S. Jain and T. Dutta, “Towards understanding and improving adversarial robustness of vision transformers,” in CVPR, 2024

work page 2024

[6] [6]

On improving adversarial transferability of vision transformers,

M. Naseer, K. Ranasinghe, S. Khan, F. S. Khan, and F. Porikli, “On improving adversarial transferability of vision transformers,” arXiv preprint arXiv:2106.04169, 2021

work page arXiv 2021

[7] [7]

Gen- erating transferable adversarial examples against vision transformers,

Y . Wang, J. Wang, Z. Yin, R. Gong, J. Wang, A. Liu, and X. Liu, “Gen- erating transferable adversarial examples against vision transformers,” in ACM MM, 2022

work page 2022

[8] [8]

Towards transferable adversarial attacks on vision transformers,

Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, and Y .-G. Jiang, “Towards transferable adversarial attacks on vision transformers,” in AAAI, 2022

work page 2022

[9] [9]

Boosting adversarial transferability with learnable patch-wise masks,

X. Wei and S. Zhao, “Boosting adversarial transferability with learnable patch-wise masks,” IEEE Transactions on Multimedia , vol. 26, pp. 3778–3787, 2023

work page 2023

[10] [10]

Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,

W. Ma, Y . Li, X. Jia, and W. Xu, “Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,” in ICCV, 2023

work page 2023

[11] [11]

Transferable adversarial attacks on vision transformers with token gradient regularization,

J. Zhang, Y . Huang, W. Wu, and M. R. Lyu, “Transferable adversarial attacks on vision transformers with token gradient regularization,” in CVPR, 2023

work page 2023

[12] [12]

Improving the adversarial transferability of vision transformers with virtual dense connection,

J. Zhang, Y . Huang, Z. Xu, W. Wu, and M. R. Lyu, “Improving the adversarial transferability of vision transformers with virtual dense connection,” in AAAI, 2024

work page 2024

[13] [13]

Attacking transformers with feature diversity adversarial perturbation,

C. Gao, H. Zhou, J. Yu, Y . Ye, J. Cai, J. Wang, and W. Yang, “Attacking transformers with feature diversity adversarial perturbation,” in AAAI, 2024

work page 2024

[14] [14]

Decision-based black-box attack against vision transformers via patch-wise adversarial removal,

Y . Shi, Y . Han, Y .-a. Tan, and X. Kuang, “Decision-based black-box attack against vision transformers via patch-wise adversarial removal,” NeurIPS, 2022

work page 2022

[15] [15]

Improving transferable targeted adversarial attacks with model self-enhancement,

H. Wu, G. Ou, W. Wu, and Z. Zheng, “Improving transferable targeted adversarial attacks with model self-enhancement,” in CVPR, 2024

work page 2024

[16] [16]

Improving transferability of adversarial samples via critical region-oriented feature-level attack,

Z. Li, M. Ren, F. Jiang, Q. Li, and Z. Sun, “Improving transferability of adversarial samples via critical region-oriented feature-level attack,” IEEE Transactions on Information Forensics and Security , vol. 19, p. 6650–6664, 2024

work page 2024

[17] [17]

Adversarial token attacks on vision transformers,

A. Joshi, G. Jagatap, and C. Hegde, “Adversarial token attacks on vision transformers,” arXiv preprint arXiv:2110.04337, 2021

work page arXiv 2021

[18] [18]

Understanding and improving adversarial transferability of vision transformers and convolutional neural networks,

Z. Chen, C. Xu, H. Lv, S. Liu, and Y . Ji, “Understanding and improving adversarial transferability of vision transformers and convolutional neural networks,” Information Sciences, vol. 648, p. 119474, 2023

work page 2023

[19] [19]

Towards transferable adversarial attacks on image and video transformers,

Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, Y .-G. Jiang, and L. S. Davis, “Towards transferable adversarial attacks on image and video transformers,” IEEE Transactions on Image Processing , vol. 32, pp. 6346–6358, 2023

work page 2023

[20] [20]

Towards efficient adversarial training on vision transformers,

B. Wu, J. Gu, Z. Li, D. Cai, X. He, and W. Liu, “Towards efficient adversarial training on vision transformers,” in ECCV, 2022

work page 2022

[21] [21]

Patch vestiges in the adversarial examples against vision trans- former can be leveraged for adversarial detection,

J. Li, “Patch vestiges in the adversarial examples against vision trans- former can be leveraged for adversarial detection,” in AAAI Workshop, 2022

work page 2022

[22] [22]

Vitguard: Attention-aware detection against adversarial examples for vision trans- former,

S. Sun, K. Nwodo, S. Sugrim, A. Stavrou, and H. Wang, “Vitguard: Attention-aware detection against adversarial examples for vision trans- former,”arXiv preprint arXiv:2409.13828, 2024

work page arXiv 2024

[23] [23]

Understanding and defending patched-based adversarial attacks for vision transformer,

L. Liu, Y . Guo, Y . Zhang, and J. Yang, “Understanding and defending patched-based adversarial attacks for vision transformer,” in ICML, 2023

work page 2023

[24] [24]

Diffusion models demand contrastive guidance for adversarial purifi- cation to advance,

M. Bai, W. Huang, T. Li, A. Wang, J. Gao, C. F. Caiafa, and Q. Zhao, “Diffusion models demand contrastive guidance for adversarial purifi- cation to advance,” in ICML, 2024

work page 2024

[25] [25]

Adbm: Adversarial diffusion bridge model for reliable adversarial purification,

X. Li, W. Sun, H. Chen, Q. Li, Y . Liu, Y . He, J. Shi, and X. Hu, “Adbm: Adversarial diffusion bridge model for reliable adversarial purification,” arXiv preprint arXiv:2408.00315, 2024. 47

work page arXiv 2024

[26] [26]

Instant adversarial purification with adversarial consistency distillation,

C. T. Lei, H. M. Yam, Z. Guo, and C. P. Lau, “Instant adversarial purification with adversarial consistency distillation,” arXiv preprint arXiv:2408.17064, 2024

work page arXiv 2024

[27] [27]

Are vision transformers robust to patch perturbations?

J. Gu, V . Tresp, and Y . Qin, “Are vision transformers robust to patch perturbations?” in ECCV, 2022

work page 2022

[28] [28]

When adversarial train- ing meets vision transformers: Recipes from training to architecture,

Y . Mo, D. Wu, Y . Wang, Y . Guo, and Y . Wang, “When adversarial train- ing meets vision transformers: Recipes from training to architecture,” NeurIPS, 2022

work page 2022

[29] [29]

Robustifying token attention for vision transformers,

Y . Guo, D. Stutz, and B. Schiele, “Robustifying token attention for vision transformers,” in ICCV, 2023

work page 2023

[30] [30]

Improving robustness of vision transformers by reducing sensitivity to patch corruptions,

Y . Y . Guo, D. L. Stutz, and B. T. Schiele, “Improving robustness of vision transformers by reducing sensitivity to patch corruptions,” in CVPR, 2023

work page 2023

[31] [31]

Improving interpretation faithfulness for vision transformers,

L. Hu, Y . Liu, N. Liu, M. Huai, L. Sun, and D. Wang, “Improving interpretation faithfulness for vision transformers,” in Proc. Int. Conf. Mach. Learn., 2024

work page 2024

[32] [32]

Random entangled tokens for adversarially robust vision transformer,

H. Gong, M. Dong, S. Ma, S. Camtepe, S. Nepal, and C. Xu, “Random entangled tokens for adversarially robust vision transformer,” in CVPR, 2024

work page 2024

[33] [33]

Diffusion models for adversarial purification,

W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandkumar, “Diffusion models for adversarial purification,” in ICML, 2022

work page 2022

[34] [34]

Purify++: Improving diffusion- purification with advanced diffusion models and control of random- ness,

B. Zhang, W. Luo, and Z. Zhang, “Purify++: Improving diffusion- purification with advanced diffusion models and control of random- ness,” arXiv preprint arXiv:2310.18762, 2023

work page arXiv 2023

[35] [35]

Diffilter: Defending against adversarial perturbations with diffusion filter,

Y . Chen, X. Li, X. Wang, P. Hu, and D. Peng, “Diffilter: Defending against adversarial perturbations with diffusion filter,” IEEE Transac- tions on Information Forensics and Security , vol. 19, pp. 6779–6794, 2024

work page 2024

[36] [36]

Mimicdiffusion: Purifying ad- versarial perturbation via mimicking clean diffusion model,

K. Song, H. Lai, Y . Pan, and J. Yin, “Mimicdiffusion: Purifying ad- versarial perturbation via mimicking clean diffusion model,” in CVPR, 2024

work page 2024

[37] [37]

Lightpure: Realtime adversarial image purification for mobile devices using diffusion models,

H. Khalili, S. Park, V . Li, B. Bright, A. Payani, R. R. Kompella, and N. Sehatbakhsh, “Lightpure: Realtime adversarial image purification for mobile devices using diffusion models,” in ACM MobiCom, 2024

work page 2024

[38] [38]

Lorid: Low-rank iterative diffusion for adversarial purifi- cation,

G. Zollicoffer, M. Vu, B. Nebgen, J. Castorena, B. Alexandrov, and M. Bhattarai, “Lorid: Low-rank iterative diffusion for adversarial purifi- cation,” arXiv preprint arXiv:2409.08255, 2024

work page arXiv 2024

[39] [39]

You are catching my attention: Are vision transformers bad learners under backdoor attacks?

Z. Yuan, P. Zhou, K. Zou, and Y . Cheng, “You are catching my attention: Are vision transformers bad learners under backdoor attacks?” inCVPR, 2023

work page 2023

[40] [40]

Trojvit: Trojan insertion in vision transformers,

M. Zheng, Q. Lou, and L. Jiang, “Trojvit: Trojan insertion in vision transformers,” in CVPR, 2023

work page 2023

[41] [41]

Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers,

S. Yang, J. Bai, K. Gao, Y . Yang, Y . Li, and S.-T. Xia, “Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers,” in CVPR, 2024

work page 2024

[42] [42]

Dbia: Data-free backdoor attack against transformer networks,

P. Lv, H. Ma, J. Zhou, R. Liang, K. Chen, S. Zhang, and Y . Yang, “Dbia: Data-free backdoor attack against transformer networks,” in ICME, 2023

work page 2023

[43] [43]

Multi-trigger backdoor attacks: More triggers, more threats,

Y . Li, X. Ma, J. He, H. Huang, and Y .-G. Jiang, “Multi-trigger backdoor attacks: More triggers, more threats,” arXiv preprint arXiv:2401.15295, 2024

work page arXiv 2024

[44] [44]

Defending backdoor attacks on vision transformer via patch processing,

K. D. Doan, Y . Lao, P. Yang, and P. Li, “Defending backdoor attacks on vision transformer via patch processing,” in AAAI, 2023

work page 2023

[45] [45]

A closer look at robustness of vision transformers to backdoor attacks,

A. Subramanya, S. A. Koohpayegani, A. Saha, A. Tejankar, and H. Pir- siavash, “A closer look at robustness of vision transformers to backdoor attacks,” in WACV, 2024

work page 2024

[46] [46]

Backdoor attacks on vision transformers,

A. Subramanya, A. Saha, S. A. Koohpayegani, A. Tejankar, and H. Pirsiavash, “Backdoor attacks on vision transformers,”arXiv preprint arXiv:2206.08477, 2022

work page arXiv 2022

[47] [47]

Practical region-level attack against segment anything models,

Y . Shen, Z. Li, and G. Wang, “Practical region-level attack against segment anything models,” in CVPR, 2024

work page 2024

[48] [48]

Segment (almost) nothing: Prompt-agnostic adversarial attacks on segmentation models,

F. Croce and M. Hein, “Segment (almost) nothing: Prompt-agnostic adversarial attacks on segmentation models,” in SaTML, 2024

work page 2024

[49] [49]

Attack-sam: Towards evaluating adversarial robustness of segment anything model,

C. Zhang, C. Zhang, T. Kang, D. Kim, S.-H. Bae, and I. S. Kweon, “Attack-sam: Towards evaluating adversarial robustness of segment anything model,” arXiv preprint arXiv:2305.00866, 2023

work page arXiv 2023

[50] [50]

Black-box targeted adversarial attack on segment anything (sam),

S. Zheng and C. Zhang, “Black-box targeted adversarial attack on segment anything (sam),” arXiv preprint arXiv:2310.10010, 2023

work page arXiv 2023

[51] [51]

Unsegment anything by simulating deformation,

J. Lu, X. Yang, and X. Wang, “Unsegment anything by simulating deformation,” in CVPR, 2024

work page 2024

[52] [52]

Transferable adversarial attacks on sam and its downstream models,

S. Xia, W. Yang, Y . Yu, X. Lin, H. Ding, L. Duan, and X. Jiang, “Transferable adversarial attacks on sam and its downstream models,” in NeurIPS, 2024

work page 2024

[53] [53]

Segment anything meets universal adversarial perturbation,

D. Han, S. Zheng, and C. Zhang, “Segment anything meets universal adversarial perturbation,” arXiv preprint arXiv:2310.12431, 2023

work page arXiv 2023

[54] [54]

Darksam: Fooling segment anything model to segment nothing,

Z. Zhou, Y . Song, M. Li, S. Hu, X. Wang, L. Y . Zhang, D. Yao, and H. Jin, “Darksam: Fooling segment anything model to segment nothing,” in NeurIPS, 2024

work page 2024

[55] [55]

Asam: Boosting segment anything model with adversarial tuning,

B. Li, H. Xiao, and L. Tang, “Asam: Boosting segment anything model with adversarial tuning,” in CVPR, 2024

work page 2024

[56] [56]

Badsam: Exploring security vulnerabilities of sam via backdoor attacks (student abstract),

Z. Guan, M. Hu, Z. Zhou, J. Zhang, S. Li, and N. Liu, “Badsam: Exploring security vulnerabilities of sam via backdoor attacks (student abstract),” in AAAI, 2024

work page 2024

[57] [57]

Unseg: One universal unlearnable example generator is enough against all image segmentation,

Y . Sun, H. Zhang, T. Zhang, X. Ma, and Y .-G. Jiang, “Unseg: One universal unlearnable example generator is enough against all image segmentation,” in NeurIPS, 2024

work page 2024

[58] [58]

Bad charac- ters: Imperceptible nlp attacks,

N. Boucher, I. Shumailov, R. Anderson, and N. Papernot, “Bad charac- ters: Imperceptible nlp attacks,” in IEEE S&P, 2022

work page 2022

[59] [59]

Is bert really robust? a strong baseline for natural language attack on text classification and entailment,

D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in AAAI, 2020

work page 2020

[60] [60]

Bert-attack: Adversarial attack against bert using bert,

L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu, “Bert-attack: Adversarial attack against bert using bert,” in EMNLP, 2020

work page 2020

[61] [61]

Gradient-based adversarial attacks against text transformers,

C. Guo, A. Sablayrolles, H. Jégou, and D. Kiela, “Gradient-based adversarial attacks against text transformers,” in EMNLP, 2021

work page 2021

[62] [62]

Breaking bert: Understanding its vulnerabilities for named entity recognition through adversarial attack,

A. Dirkson, S. Verberne, and W. Kraaij, “Breaking bert: Understanding its vulnerabilities for named entity recognition through adversarial attack,” arXiv preprint arXiv:2109.11308, 2021

work page arXiv 2021

[63] [63]

Gradient-based word substitution for obstinate adversarial examples generation in language models,

Y . Wang, P. Shi, and H. Zhang, “Gradient-based word substitution for obstinate adversarial examples generation in language models,” arXiv preprint arXiv:2307.12507, 2023

work page arXiv 2023

[64] [64]

Expanding scope: Adapting english adver- sarial attacks to chinese,

H. Liu, C. Cai, and Y . Qi, “Expanding scope: Adapting english adver- sarial attacks to chinese,” in TrustNLP, 2023

work page 2023

[65] [65]

Adversarial demonstration attacks on large language models,

J. Wang, Z. Liu, K. H. Park, Z. Jiang, Z. Zheng, Z. Wu, M. Chen, and C. Xiao, “Adversarial demonstration attacks on large language models,” arXiv preprint arXiv:2305.14950, 2023

work page arXiv 2023

[66] [66]

Adversarial attacks on large language model-based system and mitigating strategies: A case study on chatgpt,

B. Liu, B. Xiao, X. Jiang, S. Cen, X. He, and W. Dou, “Adversarial attacks on large language model-based system and mitigating strategies: A case study on chatgpt,” Security and Communication Networks , vol. 2023, p. 10, 2023

work page 2023

[67] [67]

Adversarial attacks on tables with entity swap,

A. Koleva, M. Ringsquandl, and V . Tresp, “Adversarial attacks on tables with entity swap,” arXiv preprint arXiv:2309.08650, 2023

work page arXiv 2023

[68] [68]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Certifying llm safety against adversar- ial prompting,

A. Kumar, C. Agarwal, S. Srinivas, S. Feizi, and H. Lakkaraju, “Certifying llm safety against adversarial prompting,” arXiv preprint arXiv:2309.02705, 2023

work page arXiv 2023

[70] [70]

Improving alignment and robustness with circuit breakers,

A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks, “Improving alignment and robustness with circuit breakers,” in NeurIPS, 2024

work page 2024

[71] [71]

Low-resource languages jailbreak gpt-4,

Z.-X. Yong, C. Menghini, and S. H. Bach, “Low-resource languages jailbreak gpt-4,” in NeurIPS Workshop, 2023

work page 2023

[72] [72]

Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,

Y . Yuan, W. Jiao, W. Wang, J.-t. Huang, P. He, S. Shi, and Z. Tu, “Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,” arXiv preprint arXiv:2308.06463, 2023

work page arXiv 2023

[73] [73]

Jailbroken: How does llm safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” NeurIPS, 2024

work page 2024

[74] [74]

A cross-language investigation into jailbreak attacks in large language models,

J. Li, Y . Liu, C. Liu, L. Shi, X. Ren, Y . Zheng, Y . Liu, and Y . Xue, “A cross-language investigation into jailbreak attacks in large language models,” arXiv preprint arXiv:2401.16765, 2024

work page arXiv 2024

[75] [75]

Easyjailbreak: A unified framework for jailbreaking large language models,

W. Zhou, X. Wang, L. Xiong, H. Xia, Y . Gu, M. Chai, F. Zhu, C. Huang, S. Dou, Z. Xi et al., “Easyjailbreak: A unified framework for jailbreaking large language models,” arXiv preprint arXiv:2403.12171, 2024

work page arXiv 2024

[76] [76]

Is the system message really important to jailbreaks in large language models?

X. Zou, Y . Chen, and K. Li, “Is the system message really important to jailbreaks in large language models?” arXiv preprint arXiv:2402.14857, 2024

work page arXiv 2024

[77] [77]

Tastle: Distract large language models for automatic jailbreak attack,

Z. Xiao, Y . Yang, G. Chen, and Y . Chen, “Tastle: Distract large language models for automatic jailbreak attack,” in EMNLP, 2024

work page 2024

[78] [78]

Structuralsleight: Automated jailbreak attacks on large language models utilizing uncommon text-encoded structure,

B. Li, H. Xing, C. Huang, J. Qian, H. Xiao, L. Feng, and C. Tian, “Structuralsleight: Automated jailbreak attacks on large language models utilizing uncommon text-encoded structure,” arXiv preprint arXiv:2406.08754, 2024

work page arXiv 2024

[79] [79]

Codechameleon: Personalized encryption framework for jailbreaking large language models,

H. Lv, X. Wang, Y . Zhang, C. Huang, S. Dou, J. Ye, T. Gui, Q. Zhang, and X. Huang, “Codechameleon: Personalized encryption framework for jailbreaking large language models,” arXiv preprint arXiv:2402.16717, 2024

work page arXiv 2024

[80] [80]

Play guessing game with llm: Indirect jailbreak attack with implicit clues,

Z. Chang, M. Li, Y . Liu, J. Wang, Q. Wang, and Y . Liu, “Play guessing game with llm: Indirect jailbreak attack with implicit clues,” in ACL, 2024. 48

work page 2024