Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.
Self-RedTeam: Online self-play reinforcement learning for safer LLMs
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
ClawdGo uses a self-play training loop with weakest-first scheduling and cross-session memory to raise AI agents' security awareness scores from 80.9 to 96.9 across 12 taxonomy dimensions.
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
citing papers explorer
-
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.
-
Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents
ClawdGo uses a self-play training loop with weakest-first scheduling and cross-session memory to raise AI agents' security awareness scores from 80.9 to 96.9 across 12 taxonomy dimensions.
-
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.