Recognition: 2 theorem links
· Lean TheoremThe Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Pith reviewed 2026-05-12 00:48 UTC · model grok-4.3
The pith
Self-play safety training collapses to self-consistency when attacker and defender share and update the same base model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. Anchored Bipolicy Self-Play trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, this yields up to 100x greater parameter efficiency and consistent improvements in safety without loss of reasoning ability.
What carries the argument
Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on a frozen base model to enforce role separation and prevent self-consistency collapse.
Load-bearing premise
Distinct role-specific LoRA adapters on a frozen base model will maintain stable optimisation and genuine adversarial pressure without the roles collapsing back into self-consistency during training.
What would settle it
A training run of the anchored bipolicy method in which the attacker's jailbreak success rate against the defender falls to baseline levels while safety benchmark scores show no gain over standard self-play.
Figures
read the original abstract
Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-{3B, 7B,14B}-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard self-play red-teaming collapses to self-consistency when attacker and defender share and update the same base model, so that attacks cease to exert adversarial pressure. It proposes Anchored Bipolicy Self-Play, which freezes the base model and trains distinct role-specific LoRA adapters to preserve separation and zero-sum dynamics. Experiments on Qwen2.5-3B/7B/14B-IT models report up to 100x parameter efficiency over full fine-tuning, consistent safety gains on standard benchmarks, and superior cross-play performance versus self-play baselines.
Significance. If the LoRA separation demonstrably maintains genuine adversarial pressure rather than collapsing, the method offers a practical route to more robust safety training with minimal parameter overhead and no loss of reasoning capability. The cross-play results and efficiency claims, if substantiated, would strengthen self-play as a scalable safety technique.
major comments (3)
- [Experiments] Experiments section: The manuscript reports safety improvements and cross-play superiority but provides no direct measurements (policy divergence, attack-success trajectories over epochs, or gradient orthogonality between the two LoRAs) to confirm that the attacker and defender adapters remain functionally distinct and that the zero-sum objective continues to drive discovery of vulnerabilities rather than eroding into self-consistency.
- [Evaluation] Evaluation on Qwen2.5 models: Safety gains are stated as 'consistent' across benchmarks, yet no statistical significance tests, number of independent runs, or controls for whether self-consistency was actually broken (e.g., defender refusal rates under the learned attacker) are reported; this leaves open whether gains arise from role separation or simply from added capacity/regularization.
- [Introduction/Theory] Theoretical diagnosis of collapse: The claim that shared base-model updates force collapse to self-consistency is presented as a fundamental limitation, but the manuscript supplies neither a formal derivation nor an empirical diagnostic (e.g., cosine similarity of role-specific updates) showing why this occurs specifically under the zero-sum objective.
minor comments (2)
- [Method] Notation for the bipolicy objective and LoRA anchoring is introduced without an explicit equation or diagram clarifying how the frozen base interacts with the two adapters during the zero-sum update.
- [Experiments] The abstract states 'up to 100x greater parameter efficiency' but the main text does not tabulate exact parameter counts or compare against the precise self-play baseline configuration.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on strengthening empirical validation of the self-consistency claims and the separation benefits of Anchored Bipolicy Self-Play. We address each major comment below and outline revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The manuscript reports safety improvements and cross-play superiority but provides no direct measurements (policy divergence, attack-success trajectories over epochs, or gradient orthogonality between the two LoRAs) to confirm that the attacker and defender adapters remain functionally distinct and that the zero-sum objective continues to drive discovery of vulnerabilities rather than eroding into self-consistency.
Authors: We agree that direct measurements would better substantiate that role separation preserves adversarial pressure. In the revised manuscript, we will add attack-success trajectories over training epochs, policy divergence metrics (KL divergence between attacker and defender outputs on held-out prompts), and cosine similarity of gradients between the two LoRAs to demonstrate they remain functionally distinct and that the zero-sum objective continues to drive vulnerability discovery. revision: yes
-
Referee: [Evaluation] Evaluation on Qwen2.5 models: Safety gains are stated as 'consistent' across benchmarks, yet no statistical significance tests, number of independent runs, or controls for whether self-consistency was actually broken (e.g., defender refusal rates under the learned attacker) are reported; this leaves open whether gains arise from role separation or simply from added capacity/regularization.
Authors: We acknowledge that the current evaluation lacks statistical rigor and explicit controls. We will revise the section to report results over 5 independent runs with standard deviations, include paired t-tests for significance on safety benchmarks, and add controls measuring defender refusal rates against the learned attacker (versus baseline attackers) to confirm self-consistency is broken and isolate the contribution of role separation from capacity or regularization effects. revision: yes
-
Referee: [Introduction/Theory] Theoretical diagnosis of collapse: The claim that shared base-model updates force collapse to self-consistency is presented as a fundamental limitation, but the manuscript supplies neither a formal derivation nor an empirical diagnostic (e.g., cosine similarity of role-specific updates) showing why this occurs specifically under the zero-sum objective.
Authors: The manuscript supports the diagnosis via analysis of reachable Nash equilibria under parameter sharing together with observed training dynamics. We will add an empirical diagnostic (cosine similarity of role-specific updates) in the shared base-model case to illustrate the collapse. A complete formal derivation of the dynamics, however, is beyond the current scope and would require substantial additional theoretical work. revision: partial
- A formal derivation of the collapse to self-consistency when attacker and defender share and update the same base model under the zero-sum objective
Circularity Check
No circularity: theoretical collapse claim and LoRA remedy rest on independent analysis and external benchmarks
full rationale
The paper derives the self-consistency collapse from the shared-parameter update rule in the zero-sum game (distinct from the target safety metric), then introduces role-specific LoRAs on a frozen base as an architectural fix. Success is measured on external safety benchmarks and cross-play experiments rather than by re-fitting the same quantities used to define the collapse. No self-citation chain, fitted-input-as-prediction, or self-definitional reduction appears in the derivation; the central claim remains falsifiable against held-out data and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- LoRA adapter rank and alpha
axioms (1)
- domain assumption Nash equilibrium in the zero-sum attacker-defender game corresponds to safe behavior within the game settings
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclearAnchored Bipolicy Self-Play ... trains distinct role-specific LoRA adapters on top of a frozen base model
Reference graph
Works this paper leans on
- [1]
-
[2]
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...
work page 1901
-
[4]
B. Chen, T. Zhu, J. Han, L. Li, G. Li, and X. Dai. Incentivizing truthful language models via peer elicitation games. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[5]
C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Explor- ing the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936. Associati...
work page 2019
-
[6]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
I. Csiszár and J. Körner.Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011
work page 2011
-
[9]
J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y . Wang, and Y . Yang. Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[10]
C. S. de Witt. Open challenges in multi-agent security: Towards secure systems of interacting AI Agents.arXiv preprint arXiv:2505.02077, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems, volume 36, pages 10088–10115, 2023
work page 2023
-
[12]
S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y . Lin, N. Lambert, Y . Choi, and N. Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs.Advances in Neural Information Processing Systems, 37:8093–8131, 2024
work page 2024
-
[13]
Z.-W. Hong, I. Shenfeld, T.-H. Wang, Y .-S. Chuang, A. Pareja, J. R. Glass, A. Srivastava, and P. Agrawal. Curiosity-driven red-teaming for large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[14]
N. H. R. Howe, I. R. Mckenzie, O. J. Hollinsworth, M. Zaj ˛ ac, T. Tseng, A. D. Tucker, P.-L. Bacon, and A. Gleave. Scaling trends in language model robustness. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 24080–24138. PMLR, 2025
work page 2025
-
[15]
E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. 12
work page 2022
-
[16]
J. Hu, J. K. Liu, H. Xu, and W. Shen. REINFORCE++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023
work page internal anchor Pith review arXiv 2023
- [18]
-
[19]
E. La Malfa, A. Petrov, S. Frieder, C. Weinhuber, R. Burnell, R. Nazar, A. Cohn, N. Shadbolt, and M. Wooldridge. Language-Models-as-a-Service: Overview of a new paradigm and its challenges.Journal of Artificial Intelligence Research, 80:1497–1523, 2024
work page 2024
-
[20]
M. Li, W. M. Si, M. Backes, Y . Zhang, and Y . Wang. SaLoRA: Safety-alignment preserved low-rank adaptation. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
- [21]
- [22]
-
[23]
A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y . Singer, and A. Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically. InAdvances in Neural Information Processing Systems, volume 37, pages 61065–61105, 2024
work page 2024
- [24]
-
[25]
G. Qu, A. Wierman, and N. Li. Scalable reinforcement learning of localized policies for multi-agent networked systems. InLearning for Dynamics and Control, pages 256–266. PMLR, 2020
work page 2020
-
[26]
Qwen, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015
work page 2015
-
[28]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models
L. Schwinn, D. Dobre, S. Günnemann, and G. Gidel. Adversarial attacks and defenses in large language models: Old and new threats. InProceedings on "I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, volume 239 of Proceedings of Machine Learning Research, pages 103–117. PMLR, 2023
work page 2023
-
[30]
A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A question answering chal- lenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4149–4158. Association for Computational...
work page 2019
-
[31]
Z. Tan, W. Yu, J. Si, T. Liu, K. Guan, H. Jin, J. Tao, X. Yuan, D. Ma, X. Zhang, T. Yang, and L. Sun. Triplay-rl: Tri-role self-play reinforcement learning for llm safety alignment, 2026. 13
work page 2026
-
[32]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, pages 6000–6010, 2017
work page 2017
- [33]
-
[34]
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992
work page 1992
-
[35]
S. Xhonneux, A. Sordoni, S. Günnemann, G. Gidel, and L. Schwinn. Efficient adversarial training in LLMs with continuous attacks. InAdvances in Neural Information Processing Systems, volume 37, pages 1502–1530, 2024
work page 2024
- [36]
-
[37]
Z. Xu, Y . Liu, G. Deng, Y . Li, and S. Picek. A comprehensive study of jailbreak attack versus defense for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7432–7449, 2024
work page 2024
-
[38]
R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800. Association for Computational Linguistics, 2019
work page 2019
- [39]
-
[40]
Y . Zhang, G. Qu, P. Xu, Y . Lin, Z. Chen, and A. Wierman. Global convergence of localized policy iteration in networked multi-agent reinforcement learning. InProceedings of the ACM on Measurement and Analysis of Computing Systems, volume 7, pages 1–51. ACM New York, NY , USA, 2023
work page 2023
-
[41]
C. Ziakas, N. Loo, N. Jain, and A. Russo. Red-Bandit: Test-time adaptation for LLM red- teaming via bandit-guided LoRA experts.arXiv preprint arXiv:2510.07239, 2025. 14 Appendix A More Details on Bound 1: An always-refuse equilibrium Recall our assumption that the defender can always issue a refusal response yref ∈ Y D such that r(yA, yref) = 0 for all yA...
-
[42]
a l o r a _ i n v o c a t i o n _ t o k e n s
Therefore, suppose the attacker’s policy space consists of probability distributions in a localised neighbourhood around a reference policy ¯πA, representing baseline adversarial behaviour already optimised to challenge the defender [27]. In particular, we assume that¯πA concentrates on prompts for which safe responses yield no positive reward, so that th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.