Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Pith reviewed 2026-05-23 04:37 UTC · model grok-4.3
The pith
Large models and agents face categorized safety threats from adversarial attacks to jailbreaks and agent-specific risks, with defenses and open challenges reviewed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents a comprehensive taxonomy of safety threats to large models and agents, reviews defense strategies where available, summarizes datasets and benchmarks, and discusses open challenges including comprehensive safety evaluations, scalable defenses, and sustainable data practices, along with the necessity of collective efforts.
What carries the argument
the taxonomy of safety threats covering adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and agent-specific threats
If this is right
- Defense strategies can be systematically developed and matched to each threat category in the taxonomy.
- Researchers gain a shared reference for selecting datasets and benchmarks in safety evaluations.
- Open challenges highlight priorities for future work on scalable defenses and sustainable data practices.
- Collective community efforts are positioned as necessary to create comprehensive defense platforms.
Where Pith is reading between the lines
- The taxonomy could serve as a starting point for automated tools that scan new papers for emerging threats not yet listed.
- Similar survey structures might be applied to safety in other rapidly scaling AI domains such as robotics or multimodal systems.
- If agent-specific threats grow in importance, the taxonomy may need periodic updates to remain current.
Load-bearing premise
The survey's taxonomy and coverage are assumed to be representative of the full landscape of large model safety research without major omissions in a fast-moving field.
What would settle it
Identification of a major, previously undocumented category of safety threat to large models or agents that falls outside the proposed taxonomy would undermine the claim of comprehensive coverage.
Figures
read the original abstract
The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a systematic survey of safety issues in large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. It contributes (1) a taxonomy of safety threats including adversarial attacks, data poisoning, backdoor attacks, jailbreak/prompt injection, energy-latency attacks, data/model extraction, and agent-specific threats; (2) a review of corresponding defense strategies and commonly used datasets/benchmarks; and (3) discussion of open challenges such as comprehensive evaluations, scalable defenses, and sustainable data practices, while calling for community and international collaboration.
Significance. If the taxonomy and coverage prove representative, the survey would provide a structured reference that organizes a fast-moving literature and surfaces actionable open problems (e.g., scalable defenses and multi-agent threats). The explicit inclusion of agent-specific threats and the call for collective efforts are constructive organizational contributions typical of useful surveys in the field.
major comments (1)
- [Introduction / abstract] Introduction / abstract (and any methods subsection): The central claim of a 'comprehensive taxonomy' and 'systematic review' is load-bearing for the paper's contribution, yet no literature-search protocol, database list, keyword set, inclusion/exclusion criteria, or temporal cutoff is stated. In a field that publishes hundreds of safety papers annually, the absence of such details prevents assessment of whether categories (e.g., multi-agent collusion or hardware-side extraction) are exhaustively covered or whether the taxonomy is representative.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on methodological transparency. We address the concern below and will revise the manuscript to strengthen the presentation of our survey process.
read point-by-point responses
-
Referee: [Introduction / abstract] Introduction / abstract (and any methods subsection): The central claim of a 'comprehensive taxonomy' and 'systematic review' is load-bearing for the paper's contribution, yet no literature-search protocol, database list, keyword set, inclusion/exclusion criteria, or temporal cutoff is stated. In a field that publishes hundreds of safety papers annually, the absence of such details prevents assessment of whether categories (e.g., multi-agent collusion or hardware-side extraction) are exhaustively covered or whether the taxonomy is representative.
Authors: We agree that an explicit description of the literature collection process would allow readers to better evaluate the scope and representativeness of the taxonomy. In the revised manuscript we will add a dedicated 'Survey Methodology' subsection (placed after the introduction) that specifies: (1) primary sources (arXiv, Google Scholar, proceedings of NeurIPS/ICLR/CVPR/ACL/EMNLP/USENIX Security, and selected workshops); (2) keyword combinations used for each model family and threat category; (3) inclusion criteria (peer-reviewed or preprint works from 2020 onward that directly address safety threats to the six model/agent types listed in the abstract); (4) exclusion criteria (purely theoretical works without empirical safety analysis, non-English papers, and duplicates); and (5) the effective cutoff (literature indexed through December 2024). We will also note that the taxonomy reflects the dominant threats discussed in the surveyed literature rather than claiming exhaustive coverage of every emerging sub-area (e.g., multi-agent collusion or hardware-side extraction), and we will explicitly flag these as open directions in the challenges section. This addition directly addresses the load-bearing claim without altering the core contributions. revision: yes
Circularity Check
No circularity: literature survey with no derivations or self-referential claims
full rationale
This is a survey paper whose central contributions are a taxonomy of safety threats drawn from existing literature, a review of defenses and benchmarks, and discussion of open challenges. No equations, fitted parameters, predictions, or derivation chains exist in the document. The taxonomy is presented as a synthesis of prior work rather than derived from any internal definition or self-citation that reduces to the paper's own inputs. The claim of comprehensiveness is an editorial judgment about coverage, not a mathematical or statistical reduction that can be shown equivalent to the inputs by construction. Therefore the paper is self-contained as a review and receives score 0 with no circular steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 9 Pith papers
-
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...
-
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
-
Safety, Security, and Cognitive Risks in World Models
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
-
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
-
AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models
AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.
-
LeakyCLIP: Extracting Training Data from CLIP
LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.
-
RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion
RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA w...
-
SoK: Robustness in Large Language Models against Jailbreak Attacks
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
Reference graph
Works this paper leans on
-
[1]
Patch-fool: Are vision transformers always robust against adversarial perturbations?
Y . Fu, S. Zhang, S. Wu, C. Wan, and Y . Lin, “Patch-fool: Are vision transformers always robust against adversarial perturbations?” in ICLR, 2022
work page 2022
-
[2]
Slowformer: Adversarial attack on compute and energy consumption of efficient vision transformers,
K. Navaneet, S. A. Koohpayegani, E. Sleiman, and H. Pirsiavash, “Slowformer: Adversarial attack on compute and energy consumption of efficient vision transformers,” in CVPR, 2024
work page 2024
-
[3]
Pe-attack: On the universal positional embedding vulnerability in transformer-based models,
S. Gao, T. Chen, M. He, R. Xu, H. Zhou, and J. Li, “Pe-attack: On the universal positional embedding vulnerability in transformer-based models,” IEEE Transactions on Information Forensics and Security , vol. 19, pp. 9359–9373, 2024
work page 2024
-
[4]
Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,
G. Lovisotto, N. Finnie, M. Munoz, C. K. Mummadi, and J. H. Metzen, “Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,” in CVPR, 2022
work page 2022
-
[5]
Towards understanding and improving adversarial robustness of vision transformers,
S. Jain and T. Dutta, “Towards understanding and improving adversarial robustness of vision transformers,” in CVPR, 2024
work page 2024
-
[6]
On improving adversarial transferability of vision transformers,
M. Naseer, K. Ranasinghe, S. Khan, F. S. Khan, and F. Porikli, “On improving adversarial transferability of vision transformers,” arXiv preprint arXiv:2106.04169, 2021
-
[7]
Gen- erating transferable adversarial examples against vision transformers,
Y . Wang, J. Wang, Z. Yin, R. Gong, J. Wang, A. Liu, and X. Liu, “Gen- erating transferable adversarial examples against vision transformers,” in ACM MM, 2022
work page 2022
-
[8]
Towards transferable adversarial attacks on vision transformers,
Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, and Y .-G. Jiang, “Towards transferable adversarial attacks on vision transformers,” in AAAI, 2022
work page 2022
-
[9]
Boosting adversarial transferability with learnable patch-wise masks,
X. Wei and S. Zhao, “Boosting adversarial transferability with learnable patch-wise masks,” IEEE Transactions on Multimedia , vol. 26, pp. 3778–3787, 2023
work page 2023
-
[10]
W. Ma, Y . Li, X. Jia, and W. Xu, “Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,” in ICCV, 2023
work page 2023
-
[11]
Transferable adversarial attacks on vision transformers with token gradient regularization,
J. Zhang, Y . Huang, W. Wu, and M. R. Lyu, “Transferable adversarial attacks on vision transformers with token gradient regularization,” in CVPR, 2023
work page 2023
-
[12]
Improving the adversarial transferability of vision transformers with virtual dense connection,
J. Zhang, Y . Huang, Z. Xu, W. Wu, and M. R. Lyu, “Improving the adversarial transferability of vision transformers with virtual dense connection,” in AAAI, 2024
work page 2024
-
[13]
Attacking transformers with feature diversity adversarial perturbation,
C. Gao, H. Zhou, J. Yu, Y . Ye, J. Cai, J. Wang, and W. Yang, “Attacking transformers with feature diversity adversarial perturbation,” in AAAI, 2024
work page 2024
-
[14]
Decision-based black-box attack against vision transformers via patch-wise adversarial removal,
Y . Shi, Y . Han, Y .-a. Tan, and X. Kuang, “Decision-based black-box attack against vision transformers via patch-wise adversarial removal,” NeurIPS, 2022
work page 2022
-
[15]
Improving transferable targeted adversarial attacks with model self-enhancement,
H. Wu, G. Ou, W. Wu, and Z. Zheng, “Improving transferable targeted adversarial attacks with model self-enhancement,” in CVPR, 2024
work page 2024
-
[16]
Improving transferability of adversarial samples via critical region-oriented feature-level attack,
Z. Li, M. Ren, F. Jiang, Q. Li, and Z. Sun, “Improving transferability of adversarial samples via critical region-oriented feature-level attack,” IEEE Transactions on Information Forensics and Security , vol. 19, p. 6650–6664, 2024
work page 2024
-
[17]
Adversarial token attacks on vision transformers,
A. Joshi, G. Jagatap, and C. Hegde, “Adversarial token attacks on vision transformers,” arXiv preprint arXiv:2110.04337, 2021
-
[18]
Z. Chen, C. Xu, H. Lv, S. Liu, and Y . Ji, “Understanding and improving adversarial transferability of vision transformers and convolutional neural networks,” Information Sciences, vol. 648, p. 119474, 2023
work page 2023
-
[19]
Towards transferable adversarial attacks on image and video transformers,
Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, Y .-G. Jiang, and L. S. Davis, “Towards transferable adversarial attacks on image and video transformers,” IEEE Transactions on Image Processing , vol. 32, pp. 6346–6358, 2023
work page 2023
-
[20]
Towards efficient adversarial training on vision transformers,
B. Wu, J. Gu, Z. Li, D. Cai, X. He, and W. Liu, “Towards efficient adversarial training on vision transformers,” in ECCV, 2022
work page 2022
-
[21]
J. Li, “Patch vestiges in the adversarial examples against vision trans- former can be leveraged for adversarial detection,” in AAAI Workshop, 2022
work page 2022
-
[22]
Vitguard: Attention-aware detection against adversarial examples for vision trans- former,
S. Sun, K. Nwodo, S. Sugrim, A. Stavrou, and H. Wang, “Vitguard: Attention-aware detection against adversarial examples for vision trans- former,”arXiv preprint arXiv:2409.13828, 2024
-
[23]
Understanding and defending patched-based adversarial attacks for vision transformer,
L. Liu, Y . Guo, Y . Zhang, and J. Yang, “Understanding and defending patched-based adversarial attacks for vision transformer,” in ICML, 2023
work page 2023
-
[24]
Diffusion models demand contrastive guidance for adversarial purifi- cation to advance,
M. Bai, W. Huang, T. Li, A. Wang, J. Gao, C. F. Caiafa, and Q. Zhao, “Diffusion models demand contrastive guidance for adversarial purifi- cation to advance,” in ICML, 2024
work page 2024
-
[25]
Adbm: Adversarial diffusion bridge model for reliable adversarial purification,
X. Li, W. Sun, H. Chen, Q. Li, Y . Liu, Y . He, J. Shi, and X. Hu, “Adbm: Adversarial diffusion bridge model for reliable adversarial purification,” arXiv preprint arXiv:2408.00315, 2024. 47
-
[26]
Instant adversarial purification with adversarial consistency distillation,
C. T. Lei, H. M. Yam, Z. Guo, and C. P. Lau, “Instant adversarial purification with adversarial consistency distillation,” arXiv preprint arXiv:2408.17064, 2024
-
[27]
Are vision transformers robust to patch perturbations?
J. Gu, V . Tresp, and Y . Qin, “Are vision transformers robust to patch perturbations?” in ECCV, 2022
work page 2022
-
[28]
When adversarial train- ing meets vision transformers: Recipes from training to architecture,
Y . Mo, D. Wu, Y . Wang, Y . Guo, and Y . Wang, “When adversarial train- ing meets vision transformers: Recipes from training to architecture,” NeurIPS, 2022
work page 2022
-
[29]
Robustifying token attention for vision transformers,
Y . Guo, D. Stutz, and B. Schiele, “Robustifying token attention for vision transformers,” in ICCV, 2023
work page 2023
-
[30]
Improving robustness of vision transformers by reducing sensitivity to patch corruptions,
Y . Y . Guo, D. L. Stutz, and B. T. Schiele, “Improving robustness of vision transformers by reducing sensitivity to patch corruptions,” in CVPR, 2023
work page 2023
-
[31]
Improving interpretation faithfulness for vision transformers,
L. Hu, Y . Liu, N. Liu, M. Huai, L. Sun, and D. Wang, “Improving interpretation faithfulness for vision transformers,” in Proc. Int. Conf. Mach. Learn., 2024
work page 2024
-
[32]
Random entangled tokens for adversarially robust vision transformer,
H. Gong, M. Dong, S. Ma, S. Camtepe, S. Nepal, and C. Xu, “Random entangled tokens for adversarially robust vision transformer,” in CVPR, 2024
work page 2024
-
[33]
Diffusion models for adversarial purification,
W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anandkumar, “Diffusion models for adversarial purification,” in ICML, 2022
work page 2022
-
[34]
B. Zhang, W. Luo, and Z. Zhang, “Purify++: Improving diffusion- purification with advanced diffusion models and control of random- ness,” arXiv preprint arXiv:2310.18762, 2023
-
[35]
Diffilter: Defending against adversarial perturbations with diffusion filter,
Y . Chen, X. Li, X. Wang, P. Hu, and D. Peng, “Diffilter: Defending against adversarial perturbations with diffusion filter,” IEEE Transac- tions on Information Forensics and Security , vol. 19, pp. 6779–6794, 2024
work page 2024
-
[36]
Mimicdiffusion: Purifying ad- versarial perturbation via mimicking clean diffusion model,
K. Song, H. Lai, Y . Pan, and J. Yin, “Mimicdiffusion: Purifying ad- versarial perturbation via mimicking clean diffusion model,” in CVPR, 2024
work page 2024
-
[37]
Lightpure: Realtime adversarial image purification for mobile devices using diffusion models,
H. Khalili, S. Park, V . Li, B. Bright, A. Payani, R. R. Kompella, and N. Sehatbakhsh, “Lightpure: Realtime adversarial image purification for mobile devices using diffusion models,” in ACM MobiCom, 2024
work page 2024
-
[38]
Lorid: Low-rank iterative diffusion for adversarial purifi- cation,
G. Zollicoffer, M. Vu, B. Nebgen, J. Castorena, B. Alexandrov, and M. Bhattarai, “Lorid: Low-rank iterative diffusion for adversarial purifi- cation,” arXiv preprint arXiv:2409.08255, 2024
-
[39]
You are catching my attention: Are vision transformers bad learners under backdoor attacks?
Z. Yuan, P. Zhou, K. Zou, and Y . Cheng, “You are catching my attention: Are vision transformers bad learners under backdoor attacks?” inCVPR, 2023
work page 2023
-
[40]
Trojvit: Trojan insertion in vision transformers,
M. Zheng, Q. Lou, and L. Jiang, “Trojvit: Trojan insertion in vision transformers,” in CVPR, 2023
work page 2023
-
[41]
Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers,
S. Yang, J. Bai, K. Gao, Y . Yang, Y . Li, and S.-T. Xia, “Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers,” in CVPR, 2024
work page 2024
-
[42]
Dbia: Data-free backdoor attack against transformer networks,
P. Lv, H. Ma, J. Zhou, R. Liang, K. Chen, S. Zhang, and Y . Yang, “Dbia: Data-free backdoor attack against transformer networks,” in ICME, 2023
work page 2023
-
[43]
Multi-trigger backdoor attacks: More triggers, more threats,
Y . Li, X. Ma, J. He, H. Huang, and Y .-G. Jiang, “Multi-trigger backdoor attacks: More triggers, more threats,” arXiv preprint arXiv:2401.15295, 2024
-
[44]
Defending backdoor attacks on vision transformer via patch processing,
K. D. Doan, Y . Lao, P. Yang, and P. Li, “Defending backdoor attacks on vision transformer via patch processing,” in AAAI, 2023
work page 2023
-
[45]
A closer look at robustness of vision transformers to backdoor attacks,
A. Subramanya, S. A. Koohpayegani, A. Saha, A. Tejankar, and H. Pir- siavash, “A closer look at robustness of vision transformers to backdoor attacks,” in WACV, 2024
work page 2024
-
[46]
Backdoor attacks on vision transformers,
A. Subramanya, A. Saha, S. A. Koohpayegani, A. Tejankar, and H. Pirsiavash, “Backdoor attacks on vision transformers,”arXiv preprint arXiv:2206.08477, 2022
-
[47]
Practical region-level attack against segment anything models,
Y . Shen, Z. Li, and G. Wang, “Practical region-level attack against segment anything models,” in CVPR, 2024
work page 2024
-
[48]
Segment (almost) nothing: Prompt-agnostic adversarial attacks on segmentation models,
F. Croce and M. Hein, “Segment (almost) nothing: Prompt-agnostic adversarial attacks on segmentation models,” in SaTML, 2024
work page 2024
-
[49]
Attack-sam: Towards evaluating adversarial robustness of segment anything model,
C. Zhang, C. Zhang, T. Kang, D. Kim, S.-H. Bae, and I. S. Kweon, “Attack-sam: Towards evaluating adversarial robustness of segment anything model,” arXiv preprint arXiv:2305.00866, 2023
-
[50]
Black-box targeted adversarial attack on segment anything (sam),
S. Zheng and C. Zhang, “Black-box targeted adversarial attack on segment anything (sam),” arXiv preprint arXiv:2310.10010, 2023
-
[51]
Unsegment anything by simulating deformation,
J. Lu, X. Yang, and X. Wang, “Unsegment anything by simulating deformation,” in CVPR, 2024
work page 2024
-
[52]
Transferable adversarial attacks on sam and its downstream models,
S. Xia, W. Yang, Y . Yu, X. Lin, H. Ding, L. Duan, and X. Jiang, “Transferable adversarial attacks on sam and its downstream models,” in NeurIPS, 2024
work page 2024
-
[53]
Segment anything meets universal adversarial perturbation,
D. Han, S. Zheng, and C. Zhang, “Segment anything meets universal adversarial perturbation,” arXiv preprint arXiv:2310.12431, 2023
-
[54]
Darksam: Fooling segment anything model to segment nothing,
Z. Zhou, Y . Song, M. Li, S. Hu, X. Wang, L. Y . Zhang, D. Yao, and H. Jin, “Darksam: Fooling segment anything model to segment nothing,” in NeurIPS, 2024
work page 2024
-
[55]
Asam: Boosting segment anything model with adversarial tuning,
B. Li, H. Xiao, and L. Tang, “Asam: Boosting segment anything model with adversarial tuning,” in CVPR, 2024
work page 2024
-
[56]
Badsam: Exploring security vulnerabilities of sam via backdoor attacks (student abstract),
Z. Guan, M. Hu, Z. Zhou, J. Zhang, S. Li, and N. Liu, “Badsam: Exploring security vulnerabilities of sam via backdoor attacks (student abstract),” in AAAI, 2024
work page 2024
-
[57]
Unseg: One universal unlearnable example generator is enough against all image segmentation,
Y . Sun, H. Zhang, T. Zhang, X. Ma, and Y .-G. Jiang, “Unseg: One universal unlearnable example generator is enough against all image segmentation,” in NeurIPS, 2024
work page 2024
-
[58]
Bad charac- ters: Imperceptible nlp attacks,
N. Boucher, I. Shumailov, R. Anderson, and N. Papernot, “Bad charac- ters: Imperceptible nlp attacks,” in IEEE S&P, 2022
work page 2022
-
[59]
D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in AAAI, 2020
work page 2020
-
[60]
Bert-attack: Adversarial attack against bert using bert,
L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu, “Bert-attack: Adversarial attack against bert using bert,” in EMNLP, 2020
work page 2020
-
[61]
Gradient-based adversarial attacks against text transformers,
C. Guo, A. Sablayrolles, H. Jégou, and D. Kiela, “Gradient-based adversarial attacks against text transformers,” in EMNLP, 2021
work page 2021
-
[62]
A. Dirkson, S. Verberne, and W. Kraaij, “Breaking bert: Understanding its vulnerabilities for named entity recognition through adversarial attack,” arXiv preprint arXiv:2109.11308, 2021
-
[63]
Gradient-based word substitution for obstinate adversarial examples generation in language models,
Y . Wang, P. Shi, and H. Zhang, “Gradient-based word substitution for obstinate adversarial examples generation in language models,” arXiv preprint arXiv:2307.12507, 2023
-
[64]
Expanding scope: Adapting english adver- sarial attacks to chinese,
H. Liu, C. Cai, and Y . Qi, “Expanding scope: Adapting english adver- sarial attacks to chinese,” in TrustNLP, 2023
work page 2023
-
[65]
Adversarial demonstration attacks on large language models,
J. Wang, Z. Liu, K. H. Park, Z. Jiang, Z. Zheng, Z. Wu, M. Chen, and C. Xiao, “Adversarial demonstration attacks on large language models,” arXiv preprint arXiv:2305.14950, 2023
-
[66]
B. Liu, B. Xiao, X. Jiang, S. Cen, X. He, and W. Dou, “Adversarial attacks on large language model-based system and mitigating strategies: A case study on chatgpt,” Security and Communication Networks , vol. 2023, p. 10, 2023
work page 2023
-
[67]
Adversarial attacks on tables with entity swap,
A. Koleva, M. Ringsquandl, and V . Tresp, “Adversarial attacks on tables with entity swap,” arXiv preprint arXiv:2309.08650, 2023
-
[68]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Certifying llm safety against adversar- ial prompting,
A. Kumar, C. Agarwal, S. Srinivas, S. Feizi, and H. Lakkaraju, “Certifying llm safety against adversarial prompting,” arXiv preprint arXiv:2309.02705, 2023
-
[70]
Improving alignment and robustness with circuit breakers,
A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks, “Improving alignment and robustness with circuit breakers,” in NeurIPS, 2024
work page 2024
-
[71]
Low-resource languages jailbreak gpt-4,
Z.-X. Yong, C. Menghini, and S. H. Bach, “Low-resource languages jailbreak gpt-4,” in NeurIPS Workshop, 2023
work page 2023
-
[72]
Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,
Y . Yuan, W. Jiao, W. Wang, J.-t. Huang, P. He, S. Shi, and Z. Tu, “Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,” arXiv preprint arXiv:2308.06463, 2023
-
[73]
Jailbroken: How does llm safety training fail?
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” NeurIPS, 2024
work page 2024
-
[74]
A cross-language investigation into jailbreak attacks in large language models,
J. Li, Y . Liu, C. Liu, L. Shi, X. Ren, Y . Zheng, Y . Liu, and Y . Xue, “A cross-language investigation into jailbreak attacks in large language models,” arXiv preprint arXiv:2401.16765, 2024
-
[75]
Easyjailbreak: A unified framework for jailbreaking large language models,
W. Zhou, X. Wang, L. Xiong, H. Xia, Y . Gu, M. Chai, F. Zhu, C. Huang, S. Dou, Z. Xi et al., “Easyjailbreak: A unified framework for jailbreaking large language models,” arXiv preprint arXiv:2403.12171, 2024
-
[76]
Is the system message really important to jailbreaks in large language models?
X. Zou, Y . Chen, and K. Li, “Is the system message really important to jailbreaks in large language models?” arXiv preprint arXiv:2402.14857, 2024
-
[77]
Tastle: Distract large language models for automatic jailbreak attack,
Z. Xiao, Y . Yang, G. Chen, and Y . Chen, “Tastle: Distract large language models for automatic jailbreak attack,” in EMNLP, 2024
work page 2024
-
[78]
B. Li, H. Xing, C. Huang, J. Qian, H. Xiao, L. Feng, and C. Tian, “Structuralsleight: Automated jailbreak attacks on large language models utilizing uncommon text-encoded structure,” arXiv preprint arXiv:2406.08754, 2024
-
[79]
Codechameleon: Personalized encryption framework for jailbreaking large language models,
H. Lv, X. Wang, Y . Zhang, C. Huang, S. Dou, J. Ye, T. Gui, Q. Zhang, and X. Huang, “Codechameleon: Personalized encryption framework for jailbreaking large language models,” arXiv preprint arXiv:2402.16717, 2024
-
[80]
Play guessing game with llm: Indirect jailbreak attack with implicit clues,
Z. Chang, M. Li, Y . Liu, J. Wang, Q. Wang, and Y . Liu, “Play guessing game with llm: Indirect jailbreak attack with implicit clues,” in ACL, 2024. 48
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.