Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Amanda Minnich; Blake Bullwinkel; Eugenia Kim; Mark Russinovich

arxiv: 2606.09701 · v1 · pith:FXSPYSGKnew · submitted 2026-06-08 · 💻 cs.CL · cs.AI· cs.LG

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Blake Bullwinkel , Eugenia Kim , Amanda Minnich , Mark Russinovich This is my paper

Pith reviewed 2026-06-27 16:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords red teaminglanguage modelsGRPOco-trainingattacker-defender optimizationsafety benchmarksreinforcement learningadversarial attacks

0 comments

The pith

AdvGRPO stabilizes GRPO for joint attacker-defender optimization in language model red teaming

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AdvGRPO as a way to make GRPO stable enough for co-training an attacker model and a defender model together. It adds dense multi-channel rewards and decoupled advantage normalization, then runs training through a curriculum that begins with single-turn attacks, advances to closed-loop multi-turn attacks, and ends with alternating updates between the two models. This setup is shown to generate highly effective and transferable attacks while producing defenders that beat baselines on safety benchmarks. A reader would care because the approach offers an adaptive loop for continually testing and hardening language models against evolving threats rather than relying on fixed attack sets.

Core claim

AdvGRPO is a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. The method produces highly effective and transferable attacks and co-trained defenders that outperform baselines on safety benchmarks.

What carries the argument

AdvGRPO, the framework that adds dense multi-channel rewards and decoupled advantage normalization to stabilize GRPO during alternating attacker and defender updates under a curriculum schedule

If this is right

The co-training process yields highly effective and transferable attacks on language models.
Co-trained defender models outperform standard baselines when evaluated on safety benchmarks.
A curriculum progressing from single-turn to multi-turn closed-loop attacks enables progressive improvement in attack and defense capabilities.
Alternating updates allow the attacker and defender to adapt to each other's evolving strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward and normalization adjustments could potentially stabilize other reinforcement learning methods when applied to multi-agent language model training.
Joint attacker-defender learning may reduce the need for separate, hand-crafted attack generation pipelines in red teaming workflows.
Longer interaction horizons beyond the closed-loop multi-turn setting could surface additional attack patterns not captured in the current curriculum.

Load-bearing premise

The assumption that adding dense multi-channel rewards and decoupled advantage normalization is sufficient to stabilize GRPO for joint attacker-defender optimization under the described curriculum and alternation schedule.

What would settle it

Running the same curriculum and alternation schedule with standard GRPO (no dense multi-channel rewards or decoupled advantage normalization) and checking whether training becomes unstable or yields ineffective attacks, or verifying whether the resulting defenders fail to outperform baselines on safety benchmarks, would directly test the central claim.

Figures

Figures reproduced from arXiv: 2606.09701 by Amanda Minnich, Blake Bullwinkel, Eugenia Kim, Mark Russinovich.

**Figure 1.** Figure 1: AdvGRPO architecture. (a) Attack rollout generation. The attacker πA is supplied with a system prompt and an objective and exchanges messages with the defender πD for up to K turns. Each defender response rk is scored by the attack reward A, which measures the extent to which the response satisfies the attack objective. Episodes where Ak exceeds a threshold are pruned early. The prompt scorer P evaluates a… view at source ↗

**Figure 2.** Figure 2: Upper: Reward curves for Qwen3.5-9B attacker-only training with GPT-4.1 as the defender. In addition to A and P, this reasoning-capable attacker is trained to maximize a thinking-trace reward T . Lower: Reward curves for co-training with Qwen2.5-14B as the attacker and Qwen2.5-7B as the defender. Combined attacker rewards are computed via weighted sum of the independently normalized reward channels. uncens… view at source ↗

**Figure 3.** Figure 3: Policy entropy over training steps under three settings: no regularization (left), [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

read the original abstract

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdvGRPO adds dense multi-channel rewards and decoupled normalization to GRPO for attacker-defender co-training plus a curriculum, but the abstract shows no data confirming those changes fix the reported instability.

read the letter

The main takeaway is that this paper tries to make GRPO work for joint attacker-defender optimization by introducing dense multi-channel rewards and decoupled advantage normalization on top of a curriculum that ramps from single-turn to closed-loop multi-turn attacks with alternation.

What is new is the specific AdvGRPO framing that targets the instability noted in earlier PPO and DPO co-training papers. The curriculum and alternation schedule are laid out clearly as a way to bootstrap the process, which is a reasonable engineering step for this setting.

The soft spot is the complete absence of any numbers, curves, or ablations. The abstract states that the method yields highly effective transferable attacks and that co-trained defenders beat baselines on safety benchmarks, yet nothing is shown to link those outcomes to the two listed modifications rather than the curriculum or reward details. The stress-test concern holds: without evidence that the rewards and normalization actually bound advantages or prevented collapse, the central claim rests on an untested assumption.

This is aimed at researchers working on RL for red teaming and adversarial robustness in language models. A reader already experimenting with GRPO variants might pick up the framework for ideas, but the lack of results makes it hard to judge impact. It deserves peer review if the full paper contains proper experiments and comparisons, because the underlying problem is practical and the response is direct, even if the current version is too thin to evaluate.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AdvGRPO, a co-training framework extending Group Relative Policy Optimization (GRPO) for joint attacker-defender optimization in LLM red teaming. It employs dense multi-channel rewards and decoupled advantage normalization to stabilize training, uses a curriculum progressing from single-turn to closed-loop multi-turn attacks, and alternates updates between attacker and defender models. The central claims are that this produces highly effective and transferable attacks while yielding co-trained defenders that outperform baselines on safety benchmarks.

Significance. If the stability modifications and empirical outcomes hold, the work offers a concrete path to adaptive, closed-loop red teaming that could improve defender robustness beyond static benchmarks. The curriculum design and alternation schedule address a practically relevant multi-turn regime that prior PPO/DPO co-training approaches have not fully tackled.

major comments (2)

[§4 and §5] §4 (AdvGRPO description) and §5 (experiments): The claim that dense multi-channel rewards plus decoupled advantage normalization suffice to stabilize GRPO (contrasting with prior reports of instability) is load-bearing for attributing attack effectiveness and defender gains to AdvGRPO. No ablation isolating these two components (e.g., training curves with/without each, advantage variance statistics, or divergence rates under the alternation schedule) is reported, leaving open whether stability arises instead from the curriculum, reward shaping details, or alternation itself.
[§5.3] §5.3 (defender evaluation): The reported outperformance on safety benchmarks is presented without controls for the attacker's strength at each training stage or for the number of alternation rounds; this weakens the causal link between the co-training procedure and the defender gains.

minor comments (2)

[§4.1] Notation for the multi-channel reward components and the decoupled normalization formula should be introduced with explicit equations rather than prose descriptions only.
[Introduction] The abstract and introduction cite prior GRPO instability but do not provide a reference or brief summary of the exact failure mode observed in those works.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the recognition of the potential impact of AdvGRPO for adaptive red teaming. We address each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (AdvGRPO description) and §5 (experiments): The claim that dense multi-channel rewards plus decoupled advantage normalization suffice to stabilize GRPO (contrasting with prior reports of instability) is load-bearing for attributing attack effectiveness and defender gains to AdvGRPO. No ablation isolating these two components (e.g., training curves with/without each, advantage variance statistics, or divergence rates under the alternation schedule) is reported, leaving open whether stability arises instead from the curriculum, reward shaping details, or alternation itself.

Authors: We agree that the manuscript would benefit from explicit ablations isolating the contributions of dense multi-channel rewards and decoupled advantage normalization. While these components were introduced specifically to address reported GRPO instability in co-training settings (as contrasted with prior PPO/DPO work), and the full framework enables stable curriculum progression and alternation, we did not report separate variants ablating each. In the revised manuscript we will add an ablation study with training curves, advantage variance statistics, and divergence rates comparing the full AdvGRPO against versions omitting each component individually. This will clarify their specific role relative to the curriculum and alternation schedule. revision: yes
Referee: [§5.3] §5.3 (defender evaluation): The reported outperformance on safety benchmarks is presented without controls for the attacker's strength at each training stage or for the number of alternation rounds; this weakens the causal link between the co-training procedure and the defender gains.

Authors: We acknowledge that the defender results are reported after the complete co-training procedure without intermediate controls on attacker strength or explicit variation in the number of alternation rounds. The gains are demonstrated relative to baseline defenders trained without the proposed attacker co-training. To strengthen the causal attribution, the revised manuscript will include additional analysis and experiments evaluating defender performance at intermediate training stages (with corresponding attacker strength) and under different alternation schedules. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical outcomes

full rationale

The paper introduces AdvGRPO via two modifications (dense multi-channel rewards, decoupled advantage normalization) and reports empirical results on attack effectiveness and defender performance after curriculum-based training. No equations, definitions, or self-citations reduce the central claims to inputs by construction. The derivation chain consists of standard RL training steps whose outputs are measured against external benchmarks rather than being tautological with the method's own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, new entities, or non-standard axioms are detailed beyond standard RL assumptions.

axioms (1)

domain assumption GRPO can be stabilized for co-training via dense multi-channel rewards and decoupled advantage normalization.
Central premise invoked to make GRPO viable in the attacker-defender setting.

pith-pipeline@v0.9.1-grok · 5666 in / 1058 out tokens · 18737 ms · 2026-06-27T16:26:37.590705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 2 canonical work pages

[1]

2603.13026 , archivePrefix=

Chenlong Yin and Runpeng Geng and Yanting Wang and Jinyuan Jia , year=. 2603.13026 , archivePrefix=

arXiv
[2]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[3]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[4]

2026 , eprint=

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt , author=. 2026 , eprint=

2026
[5]

2025 , eprint=

Safety Alignment of LMs via Non-cooperative Games , author=. 2025 , eprint=

2025
[6]

2023 , eprint=

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations , author=. 2023 , eprint=

2023
[7]

2025 , eprint=

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models , author=. 2025 , eprint=

2025
[8]

Feng, Mingqian and Liu, Xiaodong and Yang, Weiwei and Song, Jialin and Zhu, Xuekai and Xu, Chenliang and Gao, Jianfeng , booktitle=
[9]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2305.18290 , year=

Direct preference optimization: Your language model is secretly a reward model , author=. arXiv preprint arXiv:2305.18290 , year=

Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2209.07858 , year=

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned , author=. arXiv preprint arXiv:2209.07858 , year=

Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv
[13]

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , journal=
[14]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics , pages=

R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics , pages=

2024
[15]

2024 , eprint=

PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI Systems , author=. 2024 , eprint=

2024
[16]

Great, now write an article about that: The crescendo multi-turn

Russinovich, Mark and Salem, Ahmed and Eldan, Ronen , journal=. Great, now write an article about that: The crescendo multi-turn
[17]

arXiv preprint arXiv:2202.03286 , year=

Red teaming language models with language models , author=. arXiv preprint arXiv:2202.03286 , year=

Pith/arXiv arXiv
[18]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021
[19]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[20]

Advances in Neural Information Processing Systems , volume=

Generative adversarial nets , author=. Advances in Neural Information Processing Systems , volume=
[21]

2026 , eprint=

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. 2026 , eprint=

2026
[22]

and Singer, Yaron and Karbasi, Amin , journal=

Mehrotra, Anay and Zampetakis, Manolis and Kassianik, Paul and Nelson, Blaine and Anderson, Hyrum S. and Singer, Yaron and Karbasi, Amin , journal=. Tree of attacks: Jailbreaking black-box
[23]

arXiv preprint arXiv:2310.08419 , year=

Jailbreaking black box large language models in twenty queries , author=. arXiv preprint arXiv:2310.08419 , year=

Pith/arXiv arXiv
[24]

2024 , eprint=

Qwen2.5 Technical Report , author=. 2024 , eprint=

2024
[25]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024
[26]

2024 , eprint=

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs , author=. 2024 , eprint=

2024
[27]

Do Anything Now

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2023 , eprint=

2023
[28]

2024 , eprint=

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators , author=. 2024 , eprint=

2024
[29]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

2021
[30]

2022 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

2022
[31]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023
[32]

2024 , eprint=

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning , author=. 2024 , eprint=

2024
[33]

2025 , eprint=

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection , author=. 2025 , eprint=

2025
[34]

2026 , eprint=

Learning to Inject: Automated Prompt Injection via Reinforcement Learning , author=. 2026 , eprint=

2026
[35]

Refusal in language models is mediated by a single direction

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , booktitle =. Refusal in Language Models Is Mediated by a Single Direction , url =. doi:10.52202/079017-4322 , editor =

work page doi:10.52202/079017-4322
[36]

2025 , eprint=

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections , author=. 2025 , eprint=

2025
[37]

Break-Fix

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle , author=. 2024 , eprint=

2024
[38]

2024 , eprint=

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts , author=. 2024 , eprint=

2024
[39]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

2024
[40]

2025 , eprint=

Generalizing Verifiable Instruction Following , author=. 2025 , eprint=

2025
[41]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

2018
[42]

2025 , eprint=

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025
[43]

2024 , eprint=

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester , author=. 2024 , eprint=

2024
[44]

2026 , eprint=

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards , author=. 2026 , eprint=

2026

[1] [1]

2603.13026 , archivePrefix=

Chenlong Yin and Runpeng Geng and Yanting Wang and Jinyuan Jia , year=. 2603.13026 , archivePrefix=

arXiv

[2] [2]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[3] [3]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z

[4] [4]

2026 , eprint=

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt , author=. 2026 , eprint=

2026

[5] [5]

2025 , eprint=

Safety Alignment of LMs via Non-cooperative Games , author=. 2025 , eprint=

2025

[6] [6]

2023 , eprint=

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations , author=. 2023 , eprint=

2023

[7] [7]

2025 , eprint=

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models , author=. 2025 , eprint=

2025

[8] [8]

Feng, Mingqian and Liu, Xiaodong and Yang, Weiwei and Song, Jialin and Zhu, Xuekai and Xu, Chenliang and Gao, Jianfeng , booktitle=

[9] [9]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2305.18290 , year=

Direct preference optimization: Your language model is secretly a reward model , author=. arXiv preprint arXiv:2305.18290 , year=

Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2209.07858 , year=

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned , author=. arXiv preprint arXiv:2209.07858 , year=

Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv

[13] [13]

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , journal=

[14] [14]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics , pages=

R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics , pages=

2024

[15] [15]

2024 , eprint=

PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI Systems , author=. 2024 , eprint=

2024

[16] [16]

Great, now write an article about that: The crescendo multi-turn

Russinovich, Mark and Salem, Ahmed and Eldan, Ronen , journal=. Great, now write an article about that: The crescendo multi-turn

[17] [17]

arXiv preprint arXiv:2202.03286 , year=

Red teaming language models with language models , author=. arXiv preprint arXiv:2202.03286 , year=

Pith/arXiv arXiv

[18] [18]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021

[19] [19]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[20] [20]

Advances in Neural Information Processing Systems , volume=

Generative adversarial nets , author=. Advances in Neural Information Processing Systems , volume=

[21] [21]

2026 , eprint=

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. 2026 , eprint=

2026

[22] [22]

and Singer, Yaron and Karbasi, Amin , journal=

Mehrotra, Anay and Zampetakis, Manolis and Kassianik, Paul and Nelson, Blaine and Anderson, Hyrum S. and Singer, Yaron and Karbasi, Amin , journal=. Tree of attacks: Jailbreaking black-box

[23] [23]

arXiv preprint arXiv:2310.08419 , year=

Jailbreaking black box large language models in twenty queries , author=. arXiv preprint arXiv:2310.08419 , year=

Pith/arXiv arXiv

[24] [24]

2024 , eprint=

Qwen2.5 Technical Report , author=. 2024 , eprint=

2024

[25] [25]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024

[26] [26]

2024 , eprint=

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs , author=. 2024 , eprint=

2024

[27] [27]

Do Anything Now

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2023 , eprint=

2023

[28] [28]

2024 , eprint=

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators , author=. 2024 , eprint=

2024

[29] [29]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

2021

[30] [30]

2022 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

2022

[31] [31]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023

[32] [32]

2024 , eprint=

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning , author=. 2024 , eprint=

2024

[33] [33]

2025 , eprint=

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection , author=. 2025 , eprint=

2025

[34] [34]

2026 , eprint=

Learning to Inject: Automated Prompt Injection via Reinforcement Learning , author=. 2026 , eprint=

2026

[35] [35]

Refusal in language models is mediated by a single direction

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , booktitle =. Refusal in Language Models Is Mediated by a Single Direction , url =. doi:10.52202/079017-4322 , editor =

work page doi:10.52202/079017-4322

[36] [36]

2025 , eprint=

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections , author=. 2025 , eprint=

2025

[37] [37]

Break-Fix

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle , author=. 2024 , eprint=

2024

[38] [38]

2024 , eprint=

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts , author=. 2024 , eprint=

2024

[39] [39]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

2024

[40] [40]

2025 , eprint=

Generalizing Verifiable Instruction Following , author=. 2025 , eprint=

2025

[41] [41]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

2018

[42] [42]

2025 , eprint=

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025

[43] [43]

2024 , eprint=

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester , author=. 2024 , eprint=

2024

[44] [44]

2026 , eprint=

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards , author=. 2026 , eprint=

2026