hub Canonical reference

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song · 2024 · cs.CR · arXiv 2407.04295

Canonical reference. 89% of citing Pith papers cite this work as background.

41 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 41 citing papers arXiv PDF

abstract

Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 other 1

citation-polarity summary

background 8 unclear 1

representative citing papers

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs

cs.CR · 2026-05-30 · unverdicted · novelty 8.0

Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.

CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs

cs.CR · 2025-11-27 · conditional · novelty 8.0

CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without the trigger.

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

cs.LG · 2026-05-19 · conditional · novelty 7.0

CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

cs.CR · 2026-05-01 · unverdicted · novelty 7.0

SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

cs.CR · 2025-10-09 · unverdicted · novelty 7.0

CREST-Search is a red-teaming framework that crafts seemingly benign search queries to induce unsafe citations from web-augmented LLMs, backed by a new WebSearch-Harm dataset for fine-tuning a specialized attacker model.

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

cs.CR · 2026-06-25 · unverdicted · novelty 6.0

A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

MemAudit combines counterfactual causal influence scores with memory consistency graphs to identify poisoned records in LLM agent memory, reducing MINJA attack success from 70% to 0% in QA and 83.3% to 0% in reasoning tasks.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

cs.LG · 2026-05-14 · conditional · novelty 6.0

On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

cs.CR · 2026-05-10 · unverdicted · novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

Training a General Purpose Automated Red Teaming Model

cs.CR · 2026-04-24 · unverdicted · novelty 6.0

A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.

How Adversarial Environments Mislead Agentic AI?

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.

Towards Understanding the Robustness of Sparse Autoencoders

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

cs.CR · 2026-04-14 · unverdicted · novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

Compression as an Adversarial Amplifier Through Decision Space Reduction

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same perturbation budget.

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

cs.CR · 2026-03-23 · unverdicted · novelty 6.0

Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.

From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration

cs.MA · 2026-03-04 · unverdicted · novelty 6.0

A graph-based propagation model for error cascades in LLM multi-agent systems plus a genealogy-graph governance plugin that prevents final infection in at least 89% of runs across tested frameworks.

citing papers explorer

Showing 41 of 41 citing papers.

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs cs.CR · 2026-05-30 · unverdicted · none · ref 40 · internal anchor
Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.
CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs cs.CR · 2025-11-27 · conditional · none · ref 32 · internal anchor
CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without the trigger.
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes cs.LG · 2026-05-19 · conditional · none · ref 2 · internal anchor
CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains cs.AI · 2026-05-19 · unverdicted · none · ref 53 · internal anchor
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking cs.CR · 2026-05-01 · unverdicted · none · ref 46 · internal anchor
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 79 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models cs.CR · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models cs.CR · 2025-10-09 · unverdicted · none · ref 44 · internal anchor
CREST-Search is a red-teaming framework that crafts seemingly benign search queries to induce unsafe citations from web-augmented LLMs, backed by a new WebSearch-Harm dataset for fine-tuning a specialized attacker model.
Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models cs.CR · 2026-06-25 · unverdicted · none · ref 76 · internal anchor
A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection cs.AI · 2026-05-22 · unverdicted · none · ref 20 · internal anchor
MemAudit combines counterfactual causal influence scores with memory consistency graphs to identify poisoned records in LLM agent memory, reducing MINJA attack success from 70% to 0% in QA and 83.3% to 0% in reasoning tasks.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 69 · internal anchor
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models cs.AI · 2026-05-19 · unverdicted · none · ref 34 · internal anchor
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation cs.LG · 2026-05-14 · conditional · none · ref 3 · internal anchor
On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks cs.CR · 2026-05-10 · unverdicted · none · ref 41 · internal anchor
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 35 · internal anchor
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Training a General Purpose Automated Red Teaming Model cs.CR · 2026-04-24 · unverdicted · none · ref 4 · internal anchor
A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.
How Adversarial Environments Mislead Agentic AI? cs.AI · 2026-04-20 · unverdicted · none · ref 28 · internal anchor
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
Towards Understanding the Robustness of Sparse Autoencoders cs.LG · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs cs.CR · 2026-04-14 · unverdicted · none · ref 58 · internal anchor
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Compression as an Adversarial Amplifier Through Decision Space Reduction cs.CV · 2026-04-08 · unverdicted · none · ref 48 · internal anchor
Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same perturbation budget.
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models cs.CR · 2026-03-23 · unverdicted · none · ref 9 · internal anchor
Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.
From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration cs.MA · 2026-03-04 · unverdicted · none · ref 54 · internal anchor
A graph-based propagation model for error cascades in LLM multi-agent systems plus a genealogy-graph governance plugin that prevents final infection in at least 89% of runs across tested frameworks.
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal cs.SE · 2025-08-15 · unverdicted · none · ref 12 · internal anchor
ORFuzz presents the first evolutionary testing framework for LLM over-refusal together with a new benchmark of 1,855 cases that triggers over-refusal at 63.56% average across ten models.
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem cs.CR · 2025-06-17 · unverdicted · none · ref 1 · internal anchor
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models cs.CR · 2026-05-26 · unverdicted · none · ref 43 · internal anchor
Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.
Multilingual jailbreaking of LLMs using low-resource languages cs.CL · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality limits jailbreak success.
Insider Attacks in Multi-Agent LLM Consensus Systems cs.MA · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
SoK: Robustness in Large Language Models against Jailbreak Attacks cs.CR · 2026-05-06 · accept · none · ref 90 · internal anchor
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
AgentDID: Trustless Identity Authentication for AI Agents cs.CR · 2026-04-28 · unverdicted · none · ref 12 · internal anchor
AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models cs.CL · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
Multi-generation sampling from LLMs uncovers more jailbreak behaviors than single generations, with the largest gains from one to moderate sample counts and diminishing returns thereafter.
Jailbreaking Large Language Models with Morality Attacks cs.CL · 2026-04-18 · unverdicted · none · ref 7 · internal anchor
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs cs.CR · 2025-11-04 · unverdicted · none · ref 53 · internal anchor
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization cs.LG · 2026-06-24 · unverdicted · none · ref 31 · internal anchor
Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference cs.CR · 2026-04-14 · unverdicted · none · ref 16 · internal anchor
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.
AI Security Research Should Better Incentivize Defense Research cs.CR · 2026-05-22 · unverdicted · none · ref 42 · internal anchor
AI security research shows biased attack-to-defense ratios, with attacks evaluated under favorable conditions and defenses under stricter standards, resulting in a call to better incentivize defense work.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 171 · internal anchor
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures cs.CL · 2026-05-15 · unreviewed · ref 6 · internal anchor
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unreviewed · ref 16 · internal anchor
SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents cs.CR · 2026-02-15 · unreviewed · ref 24 · internal anchor

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer