XL-SafetyBench is a new cross-cultural benchmark showing frontier LLMs decouple jailbreak robustness from cultural sensitivity while local models trade off attack success against neutral-safe rates in a near-linear pattern indicating generation failure rather than alignment.
A strongreject for empty jailbreaks
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
citing papers explorer
-
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity
XL-SafetyBench is a new cross-cultural benchmark showing frontier LLMs decouple jailbreak robustness from cultural sensitivity while local models trade off attack success against neutral-safe rates in a near-linear pattern indicating generation failure rather than alignment.
-
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.