OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.
citation dossier
Jailbreak attacks and defenses against large language models: A survey
why this work matters in Pith
Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.CR (10 papers). The largest review-status bucket among citing papers is UNVERDICTED (18 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.
years
2026 19representative citing papers
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same perturbation budget.
A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.
Multi-generation sampling from LLMs uncovers more jailbreak behaviors than single generations, with the largest gains from one to moderate sample counts and diminishing returns thereafter.
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.
citing papers explorer
-
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.
-
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
-
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
Training a General Purpose Automated Red Teaming Model
A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.
-
How Adversarial Environments Mislead Agentic AI?
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
Compression as an Adversarial Amplifier Through Decision Space Reduction
Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same perturbation budget.
-
Insider Attacks in Multi-Agent LLM Consensus Systems
A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
-
SoK: Robustness in Large Language Models against Jailbreak Attacks
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
-
AgentDID: Trustless Identity Authentication for AI Agents
AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.
-
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
Multi-generation sampling from LLMs uncovers more jailbreak behaviors than single generations, with the largest gains from one to moderate sample counts and diminishing returns thereafter.
-
Jailbreaking Large Language Models with Morality Attacks
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
-
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.