pith. machine review for the scientific record. sign in

arxiv: 2605.08427 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.GT· cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

Elizabeth Black, Emanuele La Malfa, Gabriele La Malfa, Jie M. Zhang, Michael Luck, Michael Wooldridge, Saar Cohen

Pith reviewed 2026-05-12 00:48 UTC · model grok-4.3

classification 💻 cs.AI cs.GTcs.LG
keywords self-playAI safetyred teamingLoRA adaptersjailbreakingadversarial trainingparameter efficiencyNash equilibrium
0
0 comments X

The pith

Self-play safety training collapses to self-consistency when attacker and defender share and update the same base model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-play red teaming for AI safety pits an attacker trying to jailbreak against a defender in a zero-sum game. The paper shows that sharing and updating the same base model causes the dynamics to collapse to self-consistency, so attacks lose their adversarial effect. Anchored bipolicy self-play fixes this by training distinct LoRA adapters for each role on a frozen base model. This separation keeps optimization stable and adversarial pressure intact, leading to better safety with up to 100x parameter efficiency on Qwen models.

Core claim

When attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. Anchored Bipolicy Self-Play trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, this yields up to 100x greater parameter efficiency and consistent improvements in safety without loss of reasoning ability.

What carries the argument

Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on a frozen base model to enforce role separation and prevent self-consistency collapse.

Load-bearing premise

Distinct role-specific LoRA adapters on a frozen base model will maintain stable optimisation and genuine adversarial pressure without the roles collapsing back into self-consistency during training.

What would settle it

A training run of the anchored bipolicy method in which the attacker's jailbreak success rate against the defender falls to baseline levels while safety benchmark scores show no gain over standard self-play.

Figures

Figures reproduced from arXiv: 2605.08427 by Elizabeth Black, Emanuele La Malfa, Gabriele La Malfa, Jie M. Zhang, Michael Luck, Michael Wooldridge, Saar Cohen.

Figure 1
Figure 1. Figure 1: Architectural Comparison of Self-Play Red Team Frameworks. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: UMAP projection of the attacks of the SELF-REDTEAM and ABS on the Harmful Behavior Dataset; the base models are Qwen2.5-{7B, 14B}. The projections are similar in the embedding space, despite our ABS attacks being lexically different from the SELF-REDTEAM. The lexical analysis in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Transferability of attacks and comparative evaluation of SELF￾REDTEAM versus ABS Qwen-IT mod￾els across the tournament suite. Method Cosine Similarity Self-BLEU-3 Score # Tokens Average # Think Tokens Average Think Tokens Frequency Qwen2.5-3B-IT + SELF-REDTEAM 0.281 0.785 59.26 58.69 9.3% + ABS 0.310 0.714 232.74 88.82 70.5% Qwen2.5-7B-IT + SELF-REDTEAM 0.242 0.670 90.04 49.69 99.21% + ABS 0.201 0.646 250.… view at source ↗
read the original abstract

Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-{3B, 7B,14B}-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard self-play red-teaming collapses to self-consistency when attacker and defender share and update the same base model, so that attacks cease to exert adversarial pressure. It proposes Anchored Bipolicy Self-Play, which freezes the base model and trains distinct role-specific LoRA adapters to preserve separation and zero-sum dynamics. Experiments on Qwen2.5-3B/7B/14B-IT models report up to 100x parameter efficiency over full fine-tuning, consistent safety gains on standard benchmarks, and superior cross-play performance versus self-play baselines.

Significance. If the LoRA separation demonstrably maintains genuine adversarial pressure rather than collapsing, the method offers a practical route to more robust safety training with minimal parameter overhead and no loss of reasoning capability. The cross-play results and efficiency claims, if substantiated, would strengthen self-play as a scalable safety technique.

major comments (3)
  1. [Experiments] Experiments section: The manuscript reports safety improvements and cross-play superiority but provides no direct measurements (policy divergence, attack-success trajectories over epochs, or gradient orthogonality between the two LoRAs) to confirm that the attacker and defender adapters remain functionally distinct and that the zero-sum objective continues to drive discovery of vulnerabilities rather than eroding into self-consistency.
  2. [Evaluation] Evaluation on Qwen2.5 models: Safety gains are stated as 'consistent' across benchmarks, yet no statistical significance tests, number of independent runs, or controls for whether self-consistency was actually broken (e.g., defender refusal rates under the learned attacker) are reported; this leaves open whether gains arise from role separation or simply from added capacity/regularization.
  3. [Introduction/Theory] Theoretical diagnosis of collapse: The claim that shared base-model updates force collapse to self-consistency is presented as a fundamental limitation, but the manuscript supplies neither a formal derivation nor an empirical diagnostic (e.g., cosine similarity of role-specific updates) showing why this occurs specifically under the zero-sum objective.
minor comments (2)
  1. [Method] Notation for the bipolicy objective and LoRA anchoring is introduced without an explicit equation or diagram clarifying how the frozen base interacts with the two adapters during the zero-sum update.
  2. [Experiments] The abstract states 'up to 100x greater parameter efficiency' but the main text does not tabulate exact parameter counts or compare against the precise self-play baseline configuration.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on strengthening empirical validation of the self-consistency claims and the separation benefits of Anchored Bipolicy Self-Play. We address each major comment below and outline revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The manuscript reports safety improvements and cross-play superiority but provides no direct measurements (policy divergence, attack-success trajectories over epochs, or gradient orthogonality between the two LoRAs) to confirm that the attacker and defender adapters remain functionally distinct and that the zero-sum objective continues to drive discovery of vulnerabilities rather than eroding into self-consistency.

    Authors: We agree that direct measurements would better substantiate that role separation preserves adversarial pressure. In the revised manuscript, we will add attack-success trajectories over training epochs, policy divergence metrics (KL divergence between attacker and defender outputs on held-out prompts), and cosine similarity of gradients between the two LoRAs to demonstrate they remain functionally distinct and that the zero-sum objective continues to drive vulnerability discovery. revision: yes

  2. Referee: [Evaluation] Evaluation on Qwen2.5 models: Safety gains are stated as 'consistent' across benchmarks, yet no statistical significance tests, number of independent runs, or controls for whether self-consistency was actually broken (e.g., defender refusal rates under the learned attacker) are reported; this leaves open whether gains arise from role separation or simply from added capacity/regularization.

    Authors: We acknowledge that the current evaluation lacks statistical rigor and explicit controls. We will revise the section to report results over 5 independent runs with standard deviations, include paired t-tests for significance on safety benchmarks, and add controls measuring defender refusal rates against the learned attacker (versus baseline attackers) to confirm self-consistency is broken and isolate the contribution of role separation from capacity or regularization effects. revision: yes

  3. Referee: [Introduction/Theory] Theoretical diagnosis of collapse: The claim that shared base-model updates force collapse to self-consistency is presented as a fundamental limitation, but the manuscript supplies neither a formal derivation nor an empirical diagnostic (e.g., cosine similarity of role-specific updates) showing why this occurs specifically under the zero-sum objective.

    Authors: The manuscript supports the diagnosis via analysis of reachable Nash equilibria under parameter sharing together with observed training dynamics. We will add an empirical diagnostic (cosine similarity of role-specific updates) in the shared base-model case to illustrate the collapse. A complete formal derivation of the dynamics, however, is beyond the current scope and would require substantial additional theoretical work. revision: partial

standing simulated objections not resolved
  • A formal derivation of the collapse to self-consistency when attacker and defender share and update the same base model under the zero-sum objective

Circularity Check

0 steps flagged

No circularity: theoretical collapse claim and LoRA remedy rest on independent analysis and external benchmarks

full rationale

The paper derives the self-consistency collapse from the shared-parameter update rule in the zero-sum game (distinct from the target safety metric), then introduces role-specific LoRAs on a frozen base as an architectural fix. Success is measured on external safety benchmarks and cross-play experiments rather than by re-fitting the same quantities used to define the collapse. No self-citation chain, fitted-input-as-prediction, or self-definitional reduction appears in the derivation; the central claim remains falsifiable against held-out data and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical claim that separate adapters restore adversarial pressure; no new physical or mathematical entities are introduced, but the method assumes that LoRA rank and learning-rate choices will not reintroduce collapse.

free parameters (1)
  • LoRA adapter rank and alpha
    Hyperparameters controlling adapter capacity; their specific values are not stated in the abstract but affect whether role separation is sufficient.
axioms (1)
  • domain assumption Nash equilibrium in the zero-sum attacker-defender game corresponds to safe behavior within the game settings
    Stated in the opening paragraph as the justification for self-play.

pith-pipeline@v0.9.0 · 5607 in / 1410 out tokens · 51690 ms · 2026-05-12T00:48:08.020702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 8 internal anchors

  1. [1]

    Arras, F

    L. Arras, F. Horn, G. Montavon, K.-R. Müller, and W. Samek. Explaining predictions of non-linear classifiers in NLP. InProceedings of the 1st Workshop on Representation Learning for NLP, pages 1–7, 2016

  2. [2]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  3. [3]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

  4. [4]

    B. Chen, T. Zhu, J. Han, L. Li, G. Li, and X. Dai. Incentivizing truthful language models via peer elicitation games. InAdvances in Neural Information Processing Systems, 2025

  5. [5]

    Clark, K

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Explor- ing the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936. Associati...

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  7. [7]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  8. [8]

    Csiszár and J

    I. Csiszár and J. Körner.Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011

  9. [9]

    J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y . Wang, and Y . Yang. Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, 2024

  10. [10]

    C. S. de Witt. Open challenges in multi-agent security: Towards secure systems of interacting AI Agents.arXiv preprint arXiv:2505.02077, 2025

  11. [11]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems, volume 36, pages 10088–10115, 2023

  12. [12]

    S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y . Lin, N. Lambert, Y . Choi, and N. Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs.Advances in Neural Information Processing Systems, 37:8093–8131, 2024

  13. [13]

    Z.-W. Hong, I. Shenfeld, T.-H. Wang, Y .-S. Chuang, A. Pareja, J. R. Glass, A. Srivastava, and P. Agrawal. Curiosity-driven red-teaming for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  14. [14]

    N. H. R. Howe, I. R. Mckenzie, O. J. Hollinsworth, M. Zaj ˛ ac, T. Tseng, A. D. Tucker, P.-L. Bacon, and A. Gleave. Scaling trends in language model robustness. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 24080–24138. PMLR, 2025

  15. [15]

    E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. 12

  16. [16]

    J. Hu, J. K. Liu, H. Xu, and W. Shen. REINFORCE++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262, 2025

  17. [17]

    N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023

  18. [18]

    Jiang, K

    L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choi, et al. WildTeaming at scale: From In-the-Wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165, 2024

  19. [19]

    La Malfa, A

    E. La Malfa, A. Petrov, S. Frieder, C. Weinhuber, R. Burnell, R. Nazar, A. Cohn, N. Shadbolt, and M. Wooldridge. Language-Models-as-a-Service: Overview of a new paradigm and its challenges.Journal of Artificial Intelligence Research, 80:1497–1523, 2024

  20. [20]

    M. Li, W. M. Si, M. Backes, Y . Zhang, and Y . Wang. SaLoRA: Safety-alignment preserved low-rank adaptation. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    A. Liao, N. Tomlin, and D. Klein. Efficacy of language model self-play in non-zero-sum games. arXiv preprint arXiv:2406.18872, 2024

  22. [22]

    M. Liu, L. Jiang, Y . Liang, S. S. Du, Y . Choi, T. Althoff, and N. Jaques. Chasing moving targets with online self-play reinforcement learning for safer language models.arXiv preprint arXiv:2506.07468, 2025

  23. [23]

    Mehrotra, M

    A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y . Singer, and A. Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically. InAdvances in Neural Information Processing Systems, volume 37, pages 61065–61105, 2024

  24. [24]

    Perez, S

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, 2022

  25. [25]

    G. Qu, A. Wierman, and N. Li. Scalable reinforcement learning of localized policies for multi-agent networked systems. InLearning for Dynamics and Control, pages 256–266. PMLR, 2020

  26. [26]

    Qwen, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and...

  27. [27]

    Schulman, S

    J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  28. [28]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  29. [29]

    I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models

    L. Schwinn, D. Dobre, S. Günnemann, and G. Gidel. Adversarial attacks and defenses in large language models: Old and new threats. InProceedings on "I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, volume 239 of Proceedings of Machine Learning Research, pages 103–117. PMLR, 2023

  30. [30]

    Talmor, J

    A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A question answering chal- lenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4149–4158. Association for Computational...

  31. [31]

    Z. Tan, W. Yu, J. Si, T. Liu, K. Guan, H. Jin, J. Tao, X. Yuan, D. Ma, X. Zhang, T. Yang, and L. Sun. Triplay-rl: Tri-role self-play reinforcement learning for llm safety alignment, 2026. 13

  32. [32]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, pages 6000–6010, 2017

  33. [33]

    X. Wen, Z. He, H. Qi, Z. Wan, Z. Ma, Y . Wen, T. Zheng, X. Xu, C. Lu, and Q. Zhang. MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety.arXiv preprint arXiv:2602.01539, 2026

  34. [34]

    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

  35. [35]

    Xhonneux, A

    S. Xhonneux, A. Sordoni, S. Günnemann, G. Gidel, and L. Schwinn. Efficient adversarial training in LLMs with continuous attacks. InAdvances in Neural Information Processing Systems, volume 37, pages 1502–1530, 2024

  36. [36]

    Xiong, X

    C. Xiong, X. Qi, P.-Y . Chen, and T.-Y . Ho. Defensive prompt patch: A robust and generalizable defense of large language models against jailbreak attacks. InFindings of the Association for Computational Linguistics: ACL 2025, pages 409–437, 2025

  37. [37]

    Z. Xu, Y . Liu, G. Deng, Y . Li, and S. Picek. A comprehensive study of jailbreak attack versus defense for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7432–7449, 2024

  38. [38]

    Zellers, A

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800. Association for Computational Linguistics, 2019

  39. [39]

    Zhang, M

    Q. Zhang, M. Chen, A. Bukharin, P. He, Y . Cheng, W. Chen, and T. Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023

  40. [40]

    Zhang, G

    Y . Zhang, G. Qu, P. Xu, Y . Lin, Z. Chen, and A. Wierman. Global convergence of localized policy iteration in networked multi-agent reinforcement learning. InProceedings of the ACM on Measurement and Analysis of Computing Systems, volume 7, pages 1–51. ACM New York, NY , USA, 2023

  41. [41]

    Ziakas, N

    C. Ziakas, N. Loo, N. Jain, and A. Russo. Red-Bandit: Test-time adaptation for LLM red- teaming via bandit-guided LoRA experts.arXiv preprint arXiv:2510.07239, 2025. 14 Appendix A More Details on Bound 1: An always-refuse equilibrium Recall our assumption that the defender can always issue a refusal response yref ∈ Y D such that r(yA, yref) = 0 for all yA...

  42. [42]

    a l o r a _ i n v o c a t i o n _ t o k e n s

    Therefore, suppose the attacker’s policy space consists of probability distributions in a localised neighbourhood around a reference policy ¯πA, representing baseline adversarial behaviour already optimised to challenge the defender [27]. In particular, we assume that¯πA concentrates on prompts for which safe responses yield no positive reward, so that th...