Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues

· 2024 · arXiv 2410.10700

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

Jailbreaking Frontier Foundation Models Through Intention Deception

cs.CR · 2026-04-27 · unverdicted · novelty 7.0

A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

cs.CL · 2025-11-16 · unverdicted · novelty 6.0

EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.

Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations

cs.CL · 2026-06-19 · unverdicted · novelty 5.0

Hierarchical attention model for multi-turn jailbreak detection reports 0.9394 F1 on 14,038 conversations, beating Claude Opus 4.7 by 0.07 F1 while halving false positives.

Activation-Guided Local Editing for Jailbreaking Attacks

cs.CR · 2025-08-01 · unverdicted · novelty 5.0

AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits

cs.CR · 2026-04-01

citing papers explorer

Showing 9 of 9 citing papers.

Jailbreaking Frontier Foundation Models Through Intention Deception cs.CR · 2026-04-27 · unverdicted · none · ref 15
A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.
Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models cs.AI · 2026-06-05 · unverdicted · none · ref 13
A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment cs.AI · 2026-06-01 · unverdicted · none · ref 52
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue cs.CL · 2026-05-07 · unverdicted · none · ref 24 · 2 links
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 26
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs cs.CL · 2025-11-16 · unverdicted · none · ref 40
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations cs.CL · 2026-06-19 · unverdicted · none · ref 39
Hierarchical attention model for multi-turn jailbreak detection reports 0.9394 F1 on 14,038 conversations, beating Claude Opus 4.7 by 0.07 F1 while halving false positives.
Activation-Guided Local Editing for Jailbreaking Attacks cs.CR · 2025-08-01 · unverdicted · none · ref 5
AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.
SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits cs.CR · 2026-04-01 · unreviewed · ref 18

Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer