PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
Toxicity in chatgpt: Analyzing persona-assigned language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
citing papers explorer
-
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
- Benchmarking LLMs for Community Governance Simulation with Life-history Narratives