hub Canonical reference

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao · 2023 · cs.CL · arXiv 2310.04451

Canonical reference. 71% of citing Pith papers cite this work as background.

77 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 77 citing papers arXiv PDF

abstract

The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 2 baseline 1

citation-polarity summary

background 15 support 2 use method 2 baseline 1 unclear 1

claims ledger

abstract The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of

co-cited works

representative citing papers

Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense

cs.CR · 2026-06-28 · unverdicted · novelty 7.0

Response-time linear probing on first generated tokens detects prefilling attacks missed by prompt-time activation defenses, achieving 0/40 attack success and 0% false positives across seven models while composing orthogonally with AlphaSteer.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.

FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.

Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

cs.CR · 2026-05-09 · accept · novelty 7.0

Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

cs.CR · 2026-05-06 · conditional · novelty 7.0

TAGO performs sparse jailbreak optimization on audio LMs by retaining only high-gradient-energy tokens, preserving near-full ASR at 25% retention across three models.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

cs.CR · 2026-05-02 · unverdicted · novelty 7.0 · 2 refs

Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

Jailbroken Frontier Models Retain Their Capabilities

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

cs.AI · 2026-04-02 · unverdicted · novelty 7.0

PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

cs.CR · 2024-04-02 · conditional · novelty 7.0

Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.

Addressing Over-Refusal in LLMs with Competing Rewards

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

cs.CR · 2026-06-25 · unverdicted · novelty 6.0

A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.

Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions

cs.AI · 2026-06-24 · unverdicted · novelty 6.0

TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

cs.CR · 2026-06-21 · unverdicted · novelty 6.0

Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.

PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

PsychoPass shows adversarial LLM conversations exhibit an early geometric fingerprint in representation space that persists after removing length confounds and is detectable from short prefixes.

Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

cs.CR · 2026-05-30 · conditional · novelty 6.0

Applies MAP-Elites quality-diversity optimization to evolve semantic attack strategies across dimensions like strategy type, encoding, and length, uncovering distinct vulnerability profiles in four LLMs including GPT-4o-mini and Claude 3.5 Sonnet.

TRACE: Task-Aware Adaptive Self-Evolving Agentic Jailbreaking

cs.CR · 2026-05-29 · unverdicted · novelty 6.0

TRACE is a task-aware adaptive self-evolving jailbreaking framework that achieves up to 100% bypass rates on LLM agents via subtask decomposition and scenario evolution.

citing papers explorer

Showing 50 of 77 citing papers.

Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense cs.CR · 2026-06-28 · unverdicted · none · ref 7 · internal anchor
Response-time linear probing on first generated tokens detects prefilling attacks missed by prompt-time activation defenses, achieving 0/40 attack success and 0% false positives across seven models while composing orthogonally with AlphaSteer.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 19 · internal anchor
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems cs.CR · 2026-05-12 · unverdicted · none · ref 35 · internal anchor
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success cs.CR · 2026-05-09 · accept · none · ref 4 · internal anchor
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 39 · 2 links · internal anchor
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization cs.CR · 2026-05-06 · conditional · none · ref 18 · internal anchor
TAGO performs sparse jailbreak optimization on audio LMs by retaining only high-gradient-energy tokens, preserving near-full ASR at 25% retention across three models.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming cs.CL · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection cs.CR · 2026-05-02 · unverdicted · none · ref 3 · 2 links · internal anchor
Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 34 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Jailbroken Frontier Models Retain Their Capabilities cs.LG · 2026-04-30 · unverdicted · none · ref 24 · internal anchor
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 59 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models cs.CR · 2026-04-22 · unverdicted · none · ref 12 · internal anchor
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming cs.CL · 2026-04-21 · unverdicted · none · ref 61 · internal anchor
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 11 · internal anchor
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements cs.AI · 2026-04-02 · unverdicted · none · ref 27 · internal anchor
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 153 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack cs.CR · 2024-04-02 · conditional · none · ref 21 · internal anchor
Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.
Addressing Over-Refusal in LLMs with Competing Rewards cs.LG · 2026-06-30 · unverdicted · none · ref 32 · internal anchor
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models cs.CR · 2026-06-25 · unverdicted · none · ref 38 · internal anchor
A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.
Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions cs.AI · 2026-06-24 · unverdicted · none · ref 19 · internal anchor
TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.
The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs cs.CR · 2026-06-21 · unverdicted · none · ref 28 · internal anchor
Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.
PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations cs.CR · 2026-06-02 · unverdicted · none · ref 13 · internal anchor
PsychoPass shows adversarial LLM conversations exhibit an early geometric fingerprint in representation space that persists after removing length confounds and is detectable from short prefixes.
Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety cs.CR · 2026-05-30 · conditional · none · ref 4 · internal anchor
Applies MAP-Elites quality-diversity optimization to evolve semantic attack strategies across dimensions like strategy type, encoding, and length, uncovering distinct vulnerability profiles in four LLMs including GPT-4o-mini and Claude 3.5 Sonnet.
TRACE: Task-Aware Adaptive Self-Evolving Agentic Jailbreaking cs.CR · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
TRACE is a task-aware adaptive self-evolving jailbreaking framework that achieves up to 100% bypass rates on LLM agents via subtask decomposition and scenario evolution.
Automatically Attacking Software Reverse Engineering AI Agents cs.CR · 2026-05-28 · unverdicted · none · ref 10 · internal anchor
Genetic algorithm prompt generation enables prompt injection into binaries via string assignments to fool LLM-powered decompilers and disassemblers.
Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection cs.LG · 2026-05-27 · unverdicted · none · ref 26 · internal anchor
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry cs.CR · 2026-05-24 · unverdicted · none · ref 20 · internal anchor
RouteScan identifies malicious prompts in MoE LLMs using GPU expert routing telemetry as a privacy-preserving fingerprint, achieving AUROC above 0.93 on unseen harmful domains.
Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling cs.CR · 2026-05-23 · unverdicted · none · ref 33 · internal anchor
Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.
Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation cs.CR · 2026-05-23 · unverdicted · none · ref 1 · internal anchor
A bi-level adversarial training framework simulates diverse jailbroken activations through unsupervised extrapolation and trains a potential-induced steering field to achieve zero-shot defense with attack success rates mostly below 5% across tested LLMs and jailbreak families.
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models cs.AI · 2026-05-19 · unverdicted · none · ref 19 · internal anchor
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling cs.CR · 2026-05-18 · unverdicted · none · ref 9 · internal anchor
Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.
CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving cs.CV · 2026-05-17 · unverdicted · none · ref 28 · internal anchor
CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures cs.CL · 2026-05-15 · unverdicted · none · ref 29 · internal anchor
PQR framework generates diverse realistic queries to elicit QA agent failures, uncovering 23-78% more unhelpful responses than prior methods in e-commerce agent tests.
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models cs.CR · 2026-05-15 · unverdicted · none · ref 35 · internal anchor
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
Leveraging RAG for Training-Free Alignment of LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks cs.CR · 2026-05-10 · unverdicted · none · ref 24 · internal anchor
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 8 · 2 links · internal anchor
OTora is a two-stage framework that generates insertion-aware adversarial triggers and ICL-guided genetic payloads to induce reasoning-level denial-of-service in tool-augmented LLM agents across multiple backbones while preserving task correctness.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection cs.CR · 2026-05-05 · unverdicted · none · ref 119 · internal anchor
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment cs.AI · 2026-05-03 · unverdicted · none · ref 16 · internal anchor
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems cs.AI · 2026-05-03 · unverdicted · none · ref 23 · 2 links · internal anchor
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
A Theoretical Game of Attacks via Compositional Skills cs.CL · 2026-05-01 · unverdicted · none · ref 10 · internal anchor
A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stronger empirical performance.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption cs.CR · 2026-04-30 · unverdicted · none · ref 26 · internal anchor
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking cs.CV · 2026-04-11 · unverdicted · none · ref 5 · internal anchor
Attention-Guided Visual Jailbreaking blinds LVLMs to safety instructions by suppressing attention to alignment prefixes and anchoring generation on adversarial image features, reaching 94.4% attack success rate on Qwen-VL.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 51 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 15 · internal anchor
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks cs.CR · 2026-04-05 · unverdicted · none · ref 15 · internal anchor
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models cs.RO · 2026-03-26 · unverdicted · none · ref 18 · internal anchor
SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
BarrierSteer: LLM Safety via Learning Barrier Steering cs.LG · 2026-02-23 · unverdicted · none · ref 13 · internal anchor
BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.
Diversifying Toxicity Search in Large Language Models Through Speciation cs.NE · 2026-01-28 · unverdicted · none · ref 16 · internal anchor
ToxSearch-S applies unsupervised speciation to evolutionary prompt search, maintaining capacity-limited species with exemplar leaders and species-aware selection to achieve higher peak toxicity and broader semantic coverage than standard methods.

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer