pith. sign in

Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

citation-role summary

background 1 baseline 1

citation-polarity summary

years

2026 7 2025 2

clear filters

representative citing papers

Activation-Guided Local Editing for Jailbreaking Attacks

cs.CR · 2025-08-01 · unverdicted · novelty 5.0

AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.

citing papers explorer

Showing 9 of 9 citing papers.