arxiv: 2605.01899 · v1 · submitted 2026-05-03 · 💻 cs.AI

Recognition: unknown

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

Jiajia Li , Xiaoyu Wen , Zhongtian Ma , Shuyue Hu , Qiaosheng Zhang , Zhen Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:22 UTC · model grok-4.3

classification 💻 cs.AI

keywords persona-invariant alignmentadversarial self-playjailbreak defenseLLM safety alignmentKL-divergence constraintstructural separationpersona lineage evolution

0 comments

The pith

Adversarial self-play decouples LLM safety decisions from persona roles to resist jailbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Persona-Invariant Alignment, an adversarial self-play method in which attackers evolve risky personas through lineage propagation while defenders train via consistency learning to ignore persona context when making safety choices. It targets the specific weakness where current aligned models follow unsafe instructions once they adopt a given role or character. The approach rests on enforcing a one-sided divergence constraint so that safety outputs remain consistent regardless of the persona prompt. A sympathetic reader would care because persona-based attacks are an emerging way to bypass safety without changing the underlying request, and a method that preserves capability while blocking them could support safer use of LLMs in open-ended interactions.

Core claim

Persona-Invariant Consistency Learning, grounded in the structural separation hypothesis, applies a unilateral KL-divergence constraint between safety-focused and persona-augmented outputs. This produces a structural decoupling that keeps safety behavior intact under persona-based jailbreak attacks. The broader framework pairs this defense with Persona Lineage Evolution on the attack side, where credit propagates across related personas to explore high-risk spaces efficiently. Experiments show the resulting models achieve lower attack success rates without measurable loss in general performance.

What carries the argument

Persona-Invariant Consistency Learning (PICL) using a unilateral KL-divergence constraint to enforce safety decisions independent of persona context.

If this is right

PLE efficiently maps high-risk persona spaces through lineage-based credit assignment.
PICL lowers attack success rate on persona jailbreaks while leaving general task performance intact.
The resulting alignment remains effective across a range of persona variations generated by the co-evolving attacker.
Safety behavior stays consistent even when the model is explicitly instructed to adopt conflicting roles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling technique might be tested on other contextual signals such as tone, domain, or user history that currently modulate safety.
Models trained this way could support safer long-form role-play applications where persona consistency is required but safety must not be overridden.
If the separation holds, it could reduce the need for repeated safety fine-tuning each time new persona styles emerge.

Load-bearing premise

Safety decisions can be structurally decoupled from persona context through a unilateral KL-divergence constraint without reducing general capability or creating new attack surfaces.

What would settle it

A controlled test in which models trained with PICL are evaluated on a fresh set of persona-based jailbreak prompts never seen during training and exhibit attack success rates equal to or higher than standard safety-tuned baselines, or show measurable drops on unrelated capability benchmarks.

Figures

Figures reproduced from arXiv: 2605.01899 by Jiajia Li, Qiaosheng Zhang, Shuyue Hu, Xiaoyu Wen, Zhen Wang, Zhongtian Ma.

**Figure 1.** Figure 1: Illustration of Persona-based Jailbreak Attacks. Top: A comparison between a direct harmful query and a persona-based harmful query. Bottom: The ASR of our elite persona-based attacks across multiple mainstream LLMs, compared to baseline scenarios. Motivated by this observation, the core research question of our paper is: given a fixed instruction intent, should the model’s safety decisions depend on perso… view at source ↗

**Figure 2.** Figure 2: Persona-Invariant Alignment via Adversarial Self-Play. The attack module PLE (red block) evolves high-risk personas. The resulting jailbreak samples are then fed into the defense module PICL (blue block), which jointly optimizes DPO, SFT, and persona-invariant consistency to decouple safety decisions from persona contexts. Defense: Variational Upper Bound. Corresponding to the attack mechanism, the defense… view at source ↗

**Figure 3.** Figure 3: Evolutionary trajectory of PLE versus PersonaGA. We show average and maximum ASR, and average and minimum RtA over 40 generations. Blue and red lines denote PLE and Persona-GA, respectively. Circles and triangles indicate results on Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct. Search Efficiency. We quantify search efficiency by visualizing the evolutionary trajectories of average and maximum ASR, as we… view at source ↗

**Figure 4.** Figure 4: Ablation study of PLE. Evolutionary trajectories of average and maximum ASR, and average and minimum RtA over 40 generations. Orange, green, and blue lines denote PLE, w/o Lineage, and w/o UCB, respectively. Circles and triangles indicate results on Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct. To verify the contribution of key components in PLE, we conduct an ablation study by comparing the full framew… view at source ↗

read the original abstract

The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Persona-Invariant Alignment (PIA), an adversarial self-play framework for LLM safety against persona-based jailbreaks. It features Persona Lineage Evolution (PLE) on the attack side to explore high-risk personas via lineage-based credit propagation and Persona-Invariant Consistency Learning (PICL) on the defense side. PICL is theoretically grounded in the structural separation hypothesis and applies a unilateral KL-divergence constraint to structurally decouple safety decisions from persona context, with experiments claimed to show reduced attack success rate (ASR) while preserving general capabilities. Code is released.

Significance. If the structural separation hypothesis can be formally supported and the empirical gains hold under rigorous controls, the work would contribute a co-evolutionary defense paradigm that targets a specific, under-addressed vulnerability in current alignment techniques. The open-sourced implementation aids reproducibility.

major comments (3)

[Abstract / PICL description] Abstract and theoretical grounding of PICL: The structural separation hypothesis is invoked to justify the unilateral KL-divergence constraint, yet no derivation is supplied showing that this term forces safety-related logits (or the optimal policy) to become independent of persona embeddings—for instance, by driving the gradient of safety outputs w.r.t. persona features to zero or by guaranteeing zero mutual information. This is load-bearing because the entire defense claim rests on the constraint achieving persona-invariant safety.
[Experimental results] Experimental evaluation: The abstract states that PICL 'significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability,' but the description provides no quantitative ASR values, no baseline comparisons, and no ablation isolating the unilateral KL term. Without these, it is impossible to assess whether the reported gains are attributable to the proposed constraint or to other factors.
[PLE method] PLE and self-play dynamics: The claim that lineage-based credit propagation efficiently explores high-risk persona spaces is central to the attack side, but the manuscript does not demonstrate that the generated attacks are out-of-distribution relative to the safety training data or that they expose vulnerabilities not already covered by standard red-teaming.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one concrete ASR number and a brief statement of the strongest baseline.
[Method] Notation for the unilateral KL term and the structural separation hypothesis should be introduced with an explicit equation early in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that strengthening the theoretical derivation, providing explicit quantitative results with ablations, and validating the novelty of the generated attacks will improve the manuscript. We outline our responses and planned revisions below.

read point-by-point responses

Referee: [Abstract / PICL description] Abstract and theoretical grounding of PICL: The structural separation hypothesis is invoked to justify the unilateral KL-divergence constraint, yet no derivation is supplied showing that this term forces safety-related logits (or the optimal policy) to become independent of persona embeddings—for instance, by driving the gradient of safety outputs w.r.t. persona features to zero or by guaranteeing zero mutual information. This is load-bearing because the entire defense claim rests on the constraint achieving persona-invariant safety.

Authors: We acknowledge that the manuscript currently invokes the structural separation hypothesis without supplying an explicit derivation of how the unilateral KL term enforces independence. In the revision we will add a formal proof sketch in Section 3.2 showing that the unilateral KL constraint drives the gradient of the safety logits with respect to persona embeddings toward zero (via the chain rule on the KL term) and reduces mutual information between safety decisions and persona context. This addition will directly support the load-bearing claim. revision: yes
Referee: [Experimental results] Experimental evaluation: The abstract states that PICL 'significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability,' but the description provides no quantitative ASR values, no baseline comparisons, and no ablation isolating the unilateral KL term. Without these, it is impossible to assess whether the reported gains are attributable to the proposed constraint or to other factors.

Authors: The full manuscript contains quantitative ASR results and baseline comparisons in Table 2 and Section 4.2. However, we agree that an explicit ablation isolating the unilateral KL term is missing. In the revision we will (i) insert the concrete ASR numbers and baseline comparisons into the abstract, (ii) add a dedicated ablation subsection (Table 3) that removes the KL term while keeping all other components fixed, and (iii) report the resulting ASR degradation to demonstrate the term's contribution. revision: yes
Referee: [PLE method] PLE and self-play dynamics: The claim that lineage-based credit propagation efficiently explores high-risk persona spaces is central to the attack side, but the manuscript does not demonstrate that the generated attacks are out-of-distribution relative to the safety training data or that they expose vulnerabilities not already covered by standard red-teaming.

Authors: We will add two new analyses in the revised Section 4.1: (1) quantitative OOD metrics (cosine distance in embedding space and perplexity under the safety fine-tuning distribution) showing that PLE-generated personas lie outside the training support, and (2) a comparison against standard red-teaming baselines (e.g., AdvBench, GCG) demonstrating that a non-trivial fraction of PLE attacks succeed on models that already resist those baselines. These additions will substantiate the claim that lineage-based exploration uncovers new vulnerabilities. revision: yes

Circularity Check

1 steps flagged

Structural separation hypothesis assumes the persona-invariance that unilateral KL in PICL is claimed to derive

specific steps

self definitional [Abstract]
"Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks."

The structural separation hypothesis is defined as the possibility of decoupling safety decisions from persona context. The unilateral KL is then presented as the mechanism that 'enables' exactly this decoupling. The claimed theoretical result (maintaining safe behavior via decoupling) is therefore equivalent to the input hypothesis rather than derived from it; the method implements the assumption it invokes to justify itself.

full rationale

The paper's central theoretical claim rests on introducing the 'structural separation hypothesis' to justify the unilateral KL-divergence constraint in PICL, which is then said to 'enable the structural decoupling'. No independent derivation is provided showing that the KL term forces zero dependence (e.g., via gradients or mutual information) between safety logits and persona embeddings; instead the hypothesis directly encodes the desired invariance. This reduces the 'theoretical grounding' to a restatement of the target property, making the defense side of the self-play framework circular by construction. The attack side (PLE) is independent, but the load-bearing defense claim is not. This is a moderate circularity (score 6) rather than total (no self-citation chain or explicit fit-to-prediction reduction is quoted).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the structural separation hypothesis (introduced without prior citation) and the assumption that unilateral KL-divergence suffices to enforce persona invariance without side effects.

axioms (1)

ad hoc to paper structural separation hypothesis
Invoked to justify that safety decisions can be decoupled from persona context via the KL constraint.

pith-pipeline@v0.9.0 · 5531 in / 1311 out tokens · 22545 ms · 2026-05-09T17:22:08.022431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 26 canonical work pages · 9 internal anchors

[1]

don’t forget the teachers

Emma Harvey, Allison Koenecke, and Rene F Kizilcec. " don’t forget the teachers": Towards an educator-centered understanding of harms from large language models in education. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2025

2025
[2]

Mentalglm series: Explainable large language models for mental health analysis on chinese social media

Wei Zhai, Nan Bai, Qing Zhao, Jianqiang Li, Fan Wang, Hongzhi Qi, Meng Jiang, Xiaoqin Wang, Bing Xiang Yang, and Guanghui Fu. Mentalglm series: Explainable large language models for mental health analysis on chinese social media. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13599–13614, 2025

2025
[3]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[4]

Attacks, defenses and evaluations for llm conversation safety: A survey

Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evaluations for llm conversation safety: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6734–6747, 2024

2024
[5]

A comprehensive study of jailbreak attack versus defense for large language models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, and Stjepan Picek. A comprehensive study of jailbreak attack versus defense for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7432–7449, 2024

2024
[6]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024

2024
[7]

Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

2026
[8]

Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024

Mikayel Samvelyan, Sharath C Raparthy, Andrei Lupu, Eric Hambro, Aram H Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024

2024
[9]

Enhancing jailbreak attacks on llms via persona prompts

Zheng Zhang, Peilin Zhao, Deheng Ye, and Hao Wang. Enhancing jailbreak attacks on llms via persona prompts. arXiv preprint arXiv:2507.22171, 2025

work page arXiv 2025
[10]

A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment. arXiv preprint arXiv:2504.15585, 2025

work page arXiv 2025
[11]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322–14350, 2024

2024
[12]

All languages matter: On the multilingual safety of llms

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All languages matter: On the multilingual safety of llms. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5865–5877, 2024

2024
[13]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els

Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023. 11 Adversarial Self-Play for Persona-Invariant Safety AlignmentPREPRINT

work page arXiv 2023
[15]

Jailbreaking black box large language models in twenty queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

2025
[16]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review arXiv 2023
[17]

Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems, 37:61065–61105, 2024

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems, 37:61065–61105, 2024

2024
[18]

Toxicity in chatgpt: Analyzing persona-assigned language models

Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. InFindings of the association for computational linguistics: EMNLP 2023, pages 1236–1270, 2023

2023
[19]

Scalable and transferable black-box jailbreaks for language models via persona modulation

Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

work page arXiv 2023
[20]

Personateaming: Exploring how introducing personas can improve automated ai red-teaming

Wesley Hanwen Deng, Sunnie SY Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, and Leon A Gatys. Personateaming: Exploring how introducing personas can improve automated ai red-teaming. arXiv preprint arXiv:2509.03728, 2025

work page arXiv 2025
[21]

Safework-r1: Coevolving safety and intelligence under the AI-45 ◦ law.arXiv preprint arXiv:2507.18576, 2025

Shanghai AI Lab. Safework-r1: Coevolving safety and intelligence under the AI-45 ◦ law.arXiv preprint arXiv:2507.18576, 2025

work page arXiv 2025
[22]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023
[23]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review arXiv 2023
[24]

Token-level adversarial prompt detection based on perplexity measures and contextual information.arXiv preprint arXiv:2311.11509, 2023

Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, and Viswanathan Swaminathan. Token-level adversarial prompt detection based on perplexity measures and contextual information.arXiv preprint arXiv:2311.11509, 2023

work page arXiv 2023
[25]

Advancing the robustness of large language models through self-denoised smoothing

Jiabao Ji, Bairu Hou, Zhen Zhang, Guanhua Zhang, Wenqi Fan, Qing Li, Yang Zhang, Gaowen Liu, Sijia Liu, and Shiyu Chang. Advancing the robustness of large language models through self-denoised smoothing. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

2024
[26]

Defending llms against jailbreaking attacks via backtranslation

Yihan Wang, Zhouxing Shi, Andrew Bai, and Cho-Jui Hsieh. Defending llms against jailbreaking attacks via backtranslation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 16031–16046, 2024

2024
[27]

Alis: Aligned llm instruction security strategy for unsafe input prompt

Xinhao Song, Sufeng Duan, and Gongshen Liu. Alis: Aligned llm instruction security strategy for unsafe input prompt. InProceedings of the 31st International Conference on Computational Linguistics, pages 9124–9146, 2025

2025
[28]

Watch your language: Investigating content moderation with large language models

Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. Watch your language: Investigating content moderation with large language models. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 865–878, 2024

2024
[29]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

2023
[30]

Rigorllm: Resilient guardrails for large language models against undesired content.arXiv preprint arXiv:2403.13031, 2024

Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, and Bo Li. Rigorllm: Resilient guardrails for large language models against undesired content.arXiv preprint arXiv:2403.13031, 2024

work page arXiv 2024
[31]

Defending large language models against jailbreaking attacks through goal prioritization

Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending large language models against jailbreaking attacks through goal prioritization. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8865–8887, 2024

2024
[32]

Seas: Self- evolving adversarial safety optimization for large language models

Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang Cai, and Weiran Xu. Seas: Self- evolving adversarial safety optimization for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23778–23786, 2025

2025
[33]

Self-RedTeam: Online self-play reinforcement learning for safer LLMs,

Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, and Natasha Jaques. Chasing moving targets with online self-play reinforcement learning for safer language models.arXiv preprint arXiv:2506.07468, 2025. 12 Adversarial Self-Play for Persona-Invariant Safety AlignmentPREPRINT

work page arXiv 2025
[34]

arXiv preprint arXiv:2502.02384 , year=

Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, and Jun Zhu. Stair: Improving safety alignment with introspective reasoning.arXiv preprint arXiv:2502.02384, 2025

work page arXiv 2025
[35]

MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety

Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, and Qiaosheng Zhang. Magic: A co-evolving attacker-defender adversarial game for robust llm safety.arXiv preprint arXiv:2602.01539, 2026

work page arXiv 2026
[36]

On variational bounds of mutual information

Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. InInternational conference on machine learning, pages 5171–5180. PMLR, 2019

2019
[37]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002

2002
[38]

Divergence measures and message passing

Tom Minka et al. Divergence measures and message passing. 2005

2005
[39]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/ blog/qwen2.5/

2024
[40]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md

2024
[41]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in Neural Information Processing Systems, 37:8093–8131, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in Neural Information Processing Systems, 37:8093–8131, 2024

2024
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029, 2024

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029, 2024

2024
[44]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review arXiv 2024
[45]

The trojan detection challenge

Mantas Mazeika, Dan Hendrycks, Huichen Li, Xiaojun Xu, Sidney Hough, Andy Zou, Arezoo Rajabi, Qi Yao, Zihao Wang, Jian Tian, et al. The trojan detection challenge. InNeurIPS 2022 Competition Track, pages 279–291. PMLR, 2023

2022
[46]

arXiv preprint arXiv:2406.15513 , year=

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, and Yaodong Yang. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference.arXiv preprint arXiv:2406.15513, 2024

work page arXiv 2024
[47]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024

2024
[48]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

2023
[49]

Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

work page arXiv 2024
[50]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684, 2023

work page internal anchor Pith review arXiv 2023
[51]

Llm self defense: By self examination, llms know they are being tricked.arXiv preprint arXiv:2308.07308, 2023

Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. Llm self defense: By self examination, llms know they are being tricked.arXiv preprint arXiv:2308.07308, 2023

work page arXiv 2023
[52]

A strongreject for empty jailbreaks, 2024

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks, 2024

2024
[53]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023

work page internal anchor Pith review arXiv 2023
[54]

Catastrophic jailbreak of open-source llms via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation.arXiv preprint arXiv:2310.06987, 2023. 13 Adversarial Self-Play for Persona-Invariant Safety AlignmentPREPRINT

work page arXiv 2023
[55]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165, 2024

2024
[56]

Trustllm: Trustworthiness in large language models

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561, 2024

work page arXiv 2024
[57]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

2024
[60]

Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[61]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[62]

Charactereval: A chinese benchmark for role-playing conversational agent evaluation

Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024

2024
[63]

Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews

Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd annual meeting of the association for computational linguistics, volume 1, pages 1840–1873, 2024

2024
[64]

16personalities — free personality test, type descriptions, relationship and career advice, 2026

16Personalities. 16personalities — free personality test, type descriptions, relationship and career advice, 2026. URLhttps://www.16personalities.com/. Accessed: 2026-01-26

2026
[65]

TRL: Transformers Reinforcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020. URL https://github.com/huggingface/trl

2020
[66]

new_prompt

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics ACL 2024, pages 2318–2335, 2024. 14 Adversarial Self-Play for Persona-Invariant Safety Alignment...

2024
[67]

IFeval [57]: A dataset of approximately 500 verifiable instructions, designed to rigorously measure the instruction- following ability of fine-tuned models
[68]

AI2-ARC [58]: A collection of 7,787 grade-school science questions (Challenge and Easy sets), constructed to evaluate advanced question-answering and reasoning capabilities
[69]

GPQA-diamond [59]: A widely adopted subset of the GPQA benchmark, consisting of 198 high-quality, expert- validated multiple-choice questions in biology, physics, and chemistry, serving as a challenging test of domain expertise and advanced reasoning abilities
[70]

C Experiment Details Our training pipeline is implemented based on TRL [65], a widely used library for post-training foundation models

MMLU [60, 61]: A massive multitask benchmark covering 57 diverse subjects across STEM, the humanities, and social sciences, widely adopted as a standard proxy for broad world knowledge and problem-solving ability. C Experiment Details Our training pipeline is implemented based on TRL [65], a widely used library for post-training foundation models. For rep...

work page arXiv 2048