Chasing moving targets with online self-play reinforcement learning for safer language models

Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques · 2025 · arXiv 2506.07468

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

baseline 2

citation-polarity summary

baseline 2

representative citing papers

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

cs.CR · 2026-04-27 · unverdicted · novelty 7.0

ClawdGo uses a self-play training loop with weakest-first scheduling and cross-session memory to raise AI agents' security awareness scores from 80.9 to 96.9 across 12 taxonomy dimensions.

Learning in Structured Stackelberg Games

cs.GT · 2025-04-11 · unverdicted · novelty 7.0

Introduces structured Stackelberg games and the Stackelberg-Littlestone dimension to characterize the leader's optimal regret and sample complexity when context predicts follower type.

Addressing Over-Refusal in LLMs with Competing Rewards

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

Equation-to-Behavior Prompting lets large LLMs match cognitive models like Bayesian updating in persuasion games; RL training cuts small-model belief error by 26.5% and improves diverse training outcomes by 2.5-12%.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents cs.CR · 2026-04-27 · unverdicted · none · ref 11
ClawdGo uses a self-play training loop with weakest-first scheduling and cross-session memory to raise AI agents' security awareness scores from 80.9 to 96.9 across 12 taxonomy dimensions.

Chasing moving targets with online self-play reinforcement learning for safer language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer