hub

Jailbreak attacks and defenses against large language models: A survey

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li · 2024 · arXiv 2407.04295

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

cs.CR · 2026-05-01 · unverdicted · novelty 7.0

SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

cs.CR · 2026-05-10 · unverdicted · novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

Training a General Purpose Automated Red Teaming Model

cs.CR · 2026-04-24 · unverdicted · novelty 6.0

A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.

How Adversarial Environments Mislead Agentic AI?

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.

Towards Understanding the Robustness of Sparse Autoencoders

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

cs.CR · 2026-04-14 · unverdicted · novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

Compression as an Adversarial Amplifier Through Decision Space Reduction

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same perturbation budget.

Insider Attacks in Multi-Agent LLM Consensus Systems

cs.MA · 2026-05-08 · unverdicted · novelty 5.0

A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.

SoK: Robustness in Large Language Models against Jailbreak Attacks

cs.CR · 2026-05-06 · accept · novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

AgentDID: Trustless Identity Authentication for AI Agents

cs.CR · 2026-04-28 · unverdicted · novelty 5.0

AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

cs.CL · 2026-04-20 · unverdicted · novelty 5.0

Multi-generation sampling from LLMs uncovers more jailbreak behaviors than single generations, with the largest gains from one to moderate sample counts and diminishing returns thereafter.

Jailbreaking Large Language Models with Morality Attacks

cs.CL · 2026-04-18 · unverdicted · novelty 5.0

Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.

Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

cs.CR · 2026-04-14 · unverdicted · novelty 4.0

A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.

citing papers explorer

Showing 10 of 10 citing papers after filters.

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking cs.CR · 2026-05-01 · unverdicted · none · ref 46
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models cs.CR · 2026-04-22 · unverdicted · none · ref 25
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks cs.CR · 2026-05-10 · unverdicted · none · ref 41
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 35
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Training a General Purpose Automated Red Teaming Model cs.CR · 2026-04-24 · unverdicted · none · ref 4
A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs cs.CR · 2026-04-14 · unverdicted · none · ref 58
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 2
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
SoK: Robustness in Large Language Models against Jailbreak Attacks cs.CR · 2026-05-06 · accept · none · ref 90
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
AgentDID: Trustless Identity Authentication for AI Agents cs.CR · 2026-04-28 · unverdicted · none · ref 12
AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference cs.CR · 2026-04-14 · unverdicted · none · ref 16
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.

Jailbreak attacks and defenses against large language models: A survey

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer