A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.
Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
Hierarchical attention model for multi-turn jailbreak detection reports 0.9394 F1 on 14,038 conversations, beating Claude Opus 4.7 by 0.07 F1 while halving false positives.
AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.
citing papers explorer
-
Jailbreaking Frontier Foundation Models Through Intention Deception
A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.
-
Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models
A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.
-
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
-
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
-
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
-
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
-
Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations
Hierarchical attention model for multi-turn jailbreak detection reports 0.9394 F1 on 14,038 conversations, beating Claude Opus 4.7 by 0.07 F1 while halving false positives.
-
Activation-Guided Local Editing for Jailbreaking Attacks
AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.
- SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits