hub

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao · 2023 · cs.CL · arXiv 2310.04451

33 Pith papers cite this work. Polarity classification is still indexing.

33 Pith papers citing it

open full Pith review browse 33 citing papers arXiv PDF

abstract

The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of

co-cited works

representative citing papers

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.

FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.

Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

cs.CR · 2026-05-09 · accept · novelty 7.0

Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

cs.CR · 2026-05-06 · conditional · novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

Jailbroken Frontier Models Retain Their Capabilities

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

cs.CR · 2026-04-08 · unverdicted · novelty 7.0

SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.

Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

cs.AI · 2026-04-02 · unverdicted · novelty 7.0

PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

cs.CR · 2026-04-01 · unverdicted · novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Leveraging RAG for Training-Free Alignment of LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

cs.CR · 2026-05-10 · unverdicted · novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.

ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

cs.CR · 2026-05-05 · unverdicted · novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

cs.AI · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection rates from over 95% to under 5.47% while preserving benign performance.

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

cs.CR · 2026-05-02 · unverdicted · novelty 6.0

Refusal in LLMs leaves a detectable upstream trajectory that SALO exploits to raise jailbreak detection from near zero to over 90 percent even under forced-decoding attacks.

A Theoretical Game of Attacks via Compositional Skills

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stronger empirical performance.

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

citing papers explorer

Showing 8 of 8 citing papers after filters.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 34 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Jailbroken Frontier Models Retain Their Capabilities cs.LG · 2026-04-30 · unverdicted · none · ref 24 · internal anchor
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 153 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Leveraging RAG for Training-Free Alignment of LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 51 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 19 · internal anchor
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs cs.LG · 2026-04-17 · unverdicted · none · ref 33 · internal anchor
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer