pith. machine review for the scientific record. sign in

arxiv: 2307.15043 · v2 · submitted 2023-07-27 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Recognition: unknown

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, J. Zico Kolter, Matt Fredrikson, Milad Nasr, Nicholas Carlini, Zifan Wang

classification 💻 cs.CL cs.AIcs.CRcs.LG
keywords modelsobjectionableadversarialcontentlanguagealignedapproachattack
0
0 comments X
read the original abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

    cs.LG 2026-05 unverdicted novelty 8.0

    A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.

  2. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  3. Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

    cs.CR 2026-05 unverdicted novelty 8.0

    JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

  4. LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

    cs.CR 2026-05 conditional novelty 8.0

    LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and ...

  5. OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

    cs.LG 2026-05 unverdicted novelty 8.0

    OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...

  6. Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

    cs.CR 2026-04 conditional novelty 8.0

    Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

  7. HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

    cs.CR 2026-04 unverdicted novelty 8.0

    Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

  8. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  9. Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

    cs.CR 2026-04 unverdicted novelty 8.0

    Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

  10. The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

    cs.CR 2026-04 unverdicted novelty 8.0 full

    No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

  11. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

    cs.CR 2026-04 unverdicted novelty 8.0

    DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

  12. Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

    cs.CR 2026-04 accept novelty 8.0

    Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.

  13. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  14. LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

    cs.CL 2026-05 conditional novelty 7.0

    LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.

  15. DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-depend...

  16. BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

    cs.AI 2026-05 conditional novelty 7.0

    BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.

  17. Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

    cs.CR 2026-05 unverdicted novelty 7.0

    Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.

  18. Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

  19. The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

    cs.CR 2026-05 unverdicted novelty 7.0

    PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...

  20. BadDLM: Backdooring Diffusion Language Models with Diverse Targets

    cs.CR 2026-05 unverdicted novelty 7.0

    BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.

  21. Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

    cs.CR 2026-05 accept novelty 7.0

    Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.

  22. Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

    cs.CR 2026-05 unverdicted novelty 7.0

    Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...

  23. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

    cs.AI 2026-05 unverdicted novelty 7.0

    LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

  24. GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.

  25. GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.

  26. Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

    cs.CR 2026-05 conditional novelty 7.0

    A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.

  27. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  28. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  29. Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    LLMs suppress factual corrections in task contexts despite internal knowledge of errors, with two training-free interventions shown to increase correction rates substantially.

  30. Stego Battlefield: Evaluating Image Steganography Attacks and Steganalysis Defenses

    cs.CR 2026-05 unverdicted novelty 7.0

    SADBench is a new benchmark that systematically tests steganography attacks with harmful image and text payloads against steganalysis defenses, revealing stable attack methods, near-perfect in-domain detection, transf...

  31. On the Hardness of Junking LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.

  32. Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

    cs.CR 2026-05 conditional novelty 7.0

    Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

  33. Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

    cs.CR 2026-05 unverdicted novelty 7.0

    Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

  34. Self-Mined Hardness for Safety Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 7.0

    Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially ...

  35. ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

    cs.CL 2026-05 unverdicted novelty 7.0

    ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

  36. When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

  37. SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

    cs.CR 2026-05 unverdicted novelty 7.0

    SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...

  38. RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...

  39. MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

    cs.CR 2026-04 unverdicted novelty 7.0

    MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates su...

  40. Adaptive Prompt Embedding Optimization for LLM Jailbreaking

    cs.AI 2026-04 unverdicted novelty 7.0

    PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...

  41. Green Shielding: A User-Centric Approach Towards Trustworthy AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...

  42. AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

    cs.CR 2026-04 unverdicted novelty 7.0

    AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.

  43. Jailbreaking Frontier Foundation Models Through Intention Deception

    cs.CR 2026-04 unverdicted novelty 7.0

    A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.

  44. Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

    cs.AI 2026-04 unverdicted novelty 7.0

    A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and li...

  45. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  46. STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

    cs.CL 2026-04 unverdicted novelty 7.0

    STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cos...

  47. Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failin...

  48. Duality for the Adversarial Total Variation

    math.AP 2026-04 unverdicted novelty 7.0

    Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.

  49. Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

    cs.CL 2026-04 unverdicted novelty 7.0

    R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

  50. SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...

  51. GuardPhish: Securing Open-Source LLMs from Phishing Abuse

    cs.CR 2026-04 unverdicted novelty 7.0

    Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

  52. HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking

    cs.CR 2026-04 unverdicted novelty 7.0

    HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.

  53. When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

    cs.CL 2026-04 unverdicted novelty 7.0

    Forced-choice MCQs with only unsafe options bypass LLM safety refusals that work on equivalent open-ended prompts, with violation rates rising sharply under intermediate constraints and near saturation for model-gener...

  54. Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

    cs.CR 2026-04 unverdicted novelty 7.0

    R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.

  55. Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

    cs.CR 2026-04 unverdicted novelty 7.0

    AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

  56. Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...

  57. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

    cs.CR 2026-04 unverdicted novelty 7.0

    HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

  58. Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.

  59. SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

    cs.CR 2026-04 unverdicted novelty 7.0

    SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.

  60. JailWAM: Jailbreaking World Action Models in Robot Control

    cs.RO 2026-04 unverdicted novelty 7.0

    JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.