pith. sign in

arxiv: 2606.03647 · v1 · pith:MWHJ4A6Mnew · submitted 2026-06-02 · 💻 cs.CR · cs.AI· cs.LG

Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

Pith reviewed 2026-06-28 09:19 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords adversarial attacksLLM jailbreakingblack-box attacksdiffusion modelspreference optimizationrobustness evaluationtransferable attacks
0
0 comments X

The pith

Indirect Harm Optimization trains a black-box masked diffusion attacker via iterative preference optimization that transfers across behaviors and models while raising success rates against layered LLM defenses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create a reliable, standardized attack for evaluating LLM jailbreak robustness, similar to AutoAttack for image classifiers. Existing methods fail to be simultaneously black-box, efficient, adaptive to specific defenses, and transferable to new targets without retraining. IHO meets all these requirements by training a masked diffusion language model against a harmfulness judge, allowing it to function either as a per-behavior adaptive attack or as an amortized policy. If correct, this would make robustness claims for defended LLMs more comparable and harder to inflate through weak evaluation attacks.

Core claim

IHO is a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge that requires only black-box access to the target; the same trained model serves without modification as a strong adaptive attack on individual behaviors or as an efficient amortized policy that transfers to held-out behaviors and unseen target models, and it raises attack success rates considerably over prior methods even on layered defenses such as a Circuit Breaker model plus an auxiliary detector.

What carries the argument

Indirect Harm Optimization (IHO): a masked diffusion language model trained with iterative preference optimization against a harmfulness judge to produce transferable jailbreak prompts from black-box access only.

If this is right

  • The same IHO model works as an adaptive attack on specific behaviors or as an amortized policy transferring to new behaviors and models without fine-tuning.
  • IHO raises attack success rates over prior methods on layered defenses without any defense-specific changes.
  • IHO requires only black-box access, making it usable on closed models where gradient access is unavailable.
  • Results position IHO as a candidate for standardized jailbreak evaluation baselines that would improve reliability of defense comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defense papers may need to test against amortized attackers that generalize across targets rather than only per-defense tuned attacks.
  • The diffusion-based generation in IHO might expose vulnerabilities in current safety training that token-level or gradient-based attacks miss.
  • If IHO generalizes well, similar preference-optimization loops could be applied to create standardized attackers for other safety properties beyond jailbreaks.
  • Widespread adoption of IHO-style evaluation could shift focus from per-defense tuning to building defenses robust to black-box transferable policies.

Load-bearing premise

The harmfulness judge used during iterative preference optimization correctly labels outputs as harmful without systematic biases or errors that would misguide training or invalidate measured success rates.

What would settle it

Replicating the experiments on the Circuit Breaker plus detector defense and finding that IHO attack success rates do not exceed those of the prior state-of-the-art methods would falsify the claim of consistent improvement without defense-specific adaptation.

Figures

Figures reproduced from arXiv: 2606.03647 by David L\"udke, Jonas Dornbusch, Leo Schwinn, Stephan G\"unnemann, Vincent Limbach.

Figure 1
Figure 1. Figure 1: Conceptual comparison of automated jail [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the IHO framework. Let us now more formally introduce the precise methodology. Let M de￾note a target language model that maps a prompt p˜ to a distribution over responses PM(y | p˜). Given a judge h : Y → [0, 1] that assigns a scalar harmfulness score to a response, we define the expected harm of a prompt p˜ as H(˜p) = Ey∼PM(·|p˜) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-model transfer on held-out behaviors in the companion EVUS view, computed with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Amortized FLOPs per training behavior for IHO on Qwen-2.5-7B, broken down into [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Distribution of judge scores assigned to attacker samples at the start of each cycle for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: EVUS[128] under StrongREJECT across attack sizes and defender models [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: EVUS[128] per defender model under StrongREJECT across denoising step counts, for [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: EVUS[128] under StrongREJECT across defender models for varying sample sizes, with [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Paired EVUS[128] scores for threshold transition [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: EVUS[128] under StrongREJECT as a function of quality threshold (left) and percent [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average token-id diversity measured by root TTR (left) and EVUS[128] (right) across a [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Change in mean judge score according to StrongREJECT between epochs, averaged [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: EVUS[64] by model and hyperparameter variant for cycles 1 and 2. The [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt vs. output perplexity (log–log) at initialization, colored by judge score. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cross-model transfer on train behaviors, reported as ASR(·, 0.5) under STRONGRE￾JECT. Return to the EVUS table in the results section. Qwen-2.5-32B Qwen-2.5-7B Qwen-2.5-7B+D LLaMA-3-8B CB CB+D LAT CAT 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.85 0.95 0.97 1.00 0.97 1.00 1.00 1.00 1.00 1.00 1.00 0.97 1.00 0.97 1.00 1.00 1.00 1.00 1.00 1.… view at source ↗
Figure 16
Figure 16. Figure 16: Cross-model transfer on held-out behaviors, reported as ASR(·, 0.5) under STRONGRE￾JECT. Return to the EVUS table in the results section. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Cross-model transfer on train behaviors, reported as ASR(·, 0.8) under STRONGRE￾JECT. Return to the EVUS table in the results section. Qwen-2.5-32B Qwen-2.5-7B Qwen-2.5-7B+D LLaMA-3-8B CB CB+D LAT CAT 0.80 0.85 0.60 0.90 0.95 0.68 0.80 0.62 0.47 0.62 0.60 0.85 0.88 0.80 1.00 0.97 0.75 0.82 0.65 0.65 0.68 0.75 0.33 0.62 0.40 0.75 0.75 0.72 0.70 0.60 0.62 0.60 0.62 0.47 0.78 0.50 0.95 0.85 0.75 0.85 0.75 0.… view at source ↗
Figure 18
Figure 18. Figure 18: Cross-model transfer on held-out behaviors, reported as ASR(·, 0.8) under STRONGRE￾JECT. Return to the EVUS table in the results section. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Cross-model transfer on train behaviors, reported in EVUS under STRONGREJECT. EVUS uses attack-specific query budgets (N varies by attack). 30 [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
read the original abstract

Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making deployment risk assessment and defense comparison unreliable. Historically, standardized attacks such as AutoAttack have largely resolved this for image classifiers, providing a reliable evaluation baseline for systematic comparison across defenses. However, no equivalent exists for LLM jailbreak evaluation yet, where designing such an attack is considerably more difficult. A reliable attack must, among other things, be black-box compatible, applicable to arbitrary defense pipelines, and efficient, which no existing method jointly satisfies. We introduce Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target. The same method can be used without modification as a strong adaptive attack on individual behaviors, or as an efficient amortized policy that transfers to held-out behaviors and unseen target models without fine-tuning. Even against layered defenses, such as a Circuit Breaker-trained model combined with an auxiliary detector, IHO improves attack success considerably over state-of-the-art approaches, without any defense-specific adaptation. Our results position IHO as a practical step toward the kind of standardized jailbreak evaluation that has improved reliability in the past. Code and models are available on GitHub and Hugging Face.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge. It requires only black-box access to the target LLM and can be deployed either as an adaptive attack on individual behaviors or as an amortized policy that transfers to held-out behaviors and unseen models without fine-tuning. The central claim is that IHO considerably improves attack success rates over state-of-the-art methods even against layered defenses such as a Circuit Breaker-trained model plus an auxiliary detector, without any defense-specific adaptation, and positions IHO as a step toward standardized, reliable jailbreak evaluation akin to AutoAttack for image classifiers.

Significance. If the empirical claims hold under rigorous verification, IHO would address a genuine gap in LLM robustness evaluation by providing a black-box, efficient, and transferable attack that does not require defense-specific tuning. The public release of code and models on GitHub and Hugging Face strengthens reproducibility and enables independent verification, which is a positive contribution to the field.

major comments (2)
  1. [Abstract] Abstract: The reported gains in attack success against layered defenses (Circuit Breaker + detector) are measured using the same class of harmfulness judge employed during iterative preference optimization. This setup risks circularity: the attacker may be optimized to produce outputs that the judge misclassifies as harmful rather than genuinely harmful content, which would inflate both the optimization objective and the final success metrics without independent validation of judge accuracy.
  2. [Abstract] Abstract: No experimental details, metrics, baselines, controls, or ablation results are described to support the claims of considerable improvement, transferability to held-out behaviors, and applicability without defense-specific adaptation. Without these, it is impossible to assess whether the central empirical result is load-bearing or an artifact of the evaluation protocol.
minor comments (2)
  1. The title is unusually long and contains redundant adjectives; consider shortening for clarity while retaining the core contribution.
  2. [Abstract] The abstract references 'state-of-the-art approaches' without naming them; explicit citation of the compared methods would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported gains in attack success against layered defenses (Circuit Breaker + detector) are measured using the same class of harmfulness judge employed during iterative preference optimization. This setup risks circularity: the attacker may be optimized to produce outputs that the judge misclassifies as harmful rather than genuinely harmful content, which would inflate both the optimization objective and the final success metrics without independent validation of judge accuracy.

    Authors: We acknowledge the validity of this concern regarding potential circularity. The use of the same judge class for both training and evaluation is a common practice in automated LLM safety evaluations but does carry the risk noted. In the revised version, we will add results using an independent judge model not involved in optimization, along with a human evaluation on a random subset of successful attacks, to provide external validation of the reported gains against layered defenses. revision: yes

  2. Referee: [Abstract] Abstract: No experimental details, metrics, baselines, controls, or ablation results are described to support the claims of considerable improvement, transferability to held-out behaviors, and applicability without defense-specific adaptation. Without these, it is impossible to assess whether the central empirical result is load-bearing or an artifact of the evaluation protocol.

    Authors: Abstracts are by design concise overviews and do not contain full experimental protocols. The manuscript provides these details in Section 4 (Experiments), including attack success rate metrics, comparisons to SOTA baselines such as GCG and PAIR, controls for transferability across held-out behaviors and models, and ablations on the iterative preference optimization procedure. To address the comment, we will revise the abstract to include one or two key quantitative results (e.g., relative ASR improvements) while respecting length limits, and ensure the experimental section explicitly cross-references the abstract claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes IHO, a masked diffusion attacker trained via iterative preference optimization against an external harmfulness judge, and reports empirical improvements in attack success (measured by the same judge) over SOTA on black-box, adaptive, and transferable settings including layered defenses. This does not match any enumerated circularity pattern: there are no self-definitional equations, no fitted parameters renamed as independent predictions, no load-bearing self-citations, and no imported uniqueness theorems or ansatzes. The judge is treated as an independent evaluation component rather than a quantity defined by the method itself, and the central claims rest on empirical comparisons rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no details on free parameters, axioms, or invented entities are provided. Full text required for ledger construction.

pith-pipeline@v0.9.1-grok · 5786 in / 1091 out tokens · 19982 ms · 2026-06-28T09:19:45.421475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 14 linked inside Pith

  1. [1]

    Foundational challenges in assuring alignment and safety of large language models.arXiv preprint arXiv:2404.09932, 2024

    Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models.arXiv preprint arXiv:2404.09932, 2024

  2. [2]

    Trust- worthy, responsible, and safe ai: A comprehensive architectural framework for ai safety with challenges and mitigations.arXiv preprint arXiv:2408.12935, 2024

    Chen Chen, Xueluan Gong, Ziyao Liu, Weifeng Jiang, Si Qi Goh, and Kwok-Yan Lam. Trust- worthy, responsible, and safe ai: A comprehensive architectural framework for ai safety with challenges and mitigations.arXiv preprint arXiv:2408.12935, 2024

  3. [3]

    Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

    Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  4. [4]

    Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

  5. [5]

    Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556, 2024

    John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556, 2024

  6. [6]

    Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

    Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. InICML, 2018

  7. [7]

    On Adaptive Attacks to Adversarial Example Defenses

    Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On Adaptive Attacks to Adversarial Example Defenses. InNeurIPS, 2020

  8. [8]

    Investigating adversarial trigger transfer in large language models.Transactions of the Association for Computational Linguistics, 13:953–979, 2025

    Nicholas Meade, Arkil Patel, and Siva Reddy. Investigating adversarial trigger transfer in large language models.Transactions of the Association for Computational Linguistics, 13:953–979, 2025

  9. [9]

    Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837, 2025

    Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837, 2025

  10. [10]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151, 2024

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151, 2024

  11. [11]

    The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections

    Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, et al. The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections. arXiv preprint arXiv:2510.09023, 2025

  12. [12]

    Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms.arXiv preprint arXiv:2404.07921, 2024

    Zeyi Liao and Huan Sun. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms.arXiv preprint arXiv:2404.07921, 2024

  13. [13]

    Jailbreak-r1: Exploring the jailbreak capabilities of llms via reinforcement learning.arXiv preprint arXiv:2506.00782, 2025

    Weiyang Guo, Zesheng Shi, Zhuo Li, Yequan Wang, Xuebo Liu, Wenya Wang, Fangming Liu, Min Zhang, and Jing Li. Jailbreak-r1: Exploring the jailbreak capabilities of llms via reinforcement learning.arXiv preprint arXiv:2506.00782, 2025

  14. [14]

    A coin flip for safety: Llm judges fail to reliably measure adversarial robustness.arXiv preprint arXiv:2603.06594, 2026

    Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, and Stephan Günnemann. A coin flip for safety: Llm judges fail to reliably measure adversarial robustness.arXiv preprint arXiv:2603.06594, 2026

  15. [15]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023

  16. [16]

    On evaluating adversarial robustness.arXiv preprint arXiv:1902.06705, 2019

    Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness.arXiv preprint arXiv:1902.06705, 2019

  17. [17]

    Sampling-aware adversarial attacks against large language models.arXiv preprint arXiv:2507.04446, 2025

    Tim Beyer, Yan Scholten, Leo Schwinn, and Stephan Günnemann. Sampling-aware adversarial attacks against large language models.arXiv preprint arXiv:2507.04446, 2025

  18. [18]

    Attacking large language models with projected gradient descent.arXiv preprint arXiv:2402.09154, 2024

    Simon Geisler, Tom Wollschläger, MHI Abdalla, Johannes Gasteiger, and Stephan Günne- mann. Attacking large language models with projected gradient descent.arXiv preprint arXiv:2402.09154, 2024. 10

  19. [19]

    Tree of attacks: Jailbreaking black-box llms automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. In NeurIPS, 2024

  20. [20]

    Diffusion llms are natural adversaries for any llm.arXiv preprint arXiv:2511.00203, 2025

    David Lüdke, Tom Wollschläger, Paul Ungermann, Stephan Günnemann, and Leo Schwinn. Diffusion llms are natural adversaries for any llm.arXiv preprint arXiv:2511.00203, 2025

  21. [21]

    REINFORCE adversarial attacks on large language models: An adaptive, distributional, and semantic objective.arXiv preprint arXiv:2502.17254, 2025

    Simon Geisler, Tom Wollschläger, MHI Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, and Stephan Günnemann. REINFORCE adversarial attacks on large language models: An adaptive, distributional, and semantic objective.arXiv preprint arXiv:2502.17254, 2025

  22. [22]

    Lora: Low-rank adaptation of large language models.ICLR, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022

  23. [23]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  25. [25]

    Improving alignment and robustness with short circuiting.arXiv preprint arXiv:2406.04313, 2024

    Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with short circuiting.arXiv preprint arXiv:2406.04313, 2024

  26. [26]

    Latent adver- sarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549, 2024

    Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, et al. Latent adver- sarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549, 2024

  27. [27]

    Efficient adversarial training in llms with continuous attacks

    Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, and Leo Schwinn. Efficient adversarial training in llms with continuous attacks. InNeurIPS, 2024

  28. [28]

    Polyguard: A multilingual safety moderation tool for 17 lan- guages.arXiv preprint arXiv:2504.04377, 2025

    Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. Polyguard: A multilingual safety moderation tool for 17 lan- guages.arXiv preprint arXiv:2504.04377, 2025

  29. [29]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024

  30. [30]

    A strongreject for empty jailbreaks.arXiv preprint arXiv:2402.10260, 2024

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks.arXiv preprint arXiv:2402.10260, 2024

  31. [31]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  32. [32]

    Large language diffusion models, 2025

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. URL https: //arxiv.org/abs/2502.09992

  33. [33]

    Autodan: Automatic and interpretable adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023

    Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: Automatic and interpretable adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023

  34. [34]

    Advprompter: Fast adaptive adversarial prompting for llms.arXiv preprint arXiv:2404.16873, 2024

    Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. Advprompter: Fast adaptive adversarial prompting for llms.arXiv preprint arXiv:2404.16873, 2024. 11

  35. [35]

    Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.arXiv preprint arXiv:2410.05295, 2024

    Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.arXiv preprint arXiv:2410.05295, 2024

  36. [36]

    Ipo: Your language model is secretly a preference classifier, 2025

    Shivank Garg, Ayush Singh, Shweta Singh, and Paras Chopra. Ipo: Your language model is secretly a preference classifier, 2025. URLhttps://arxiv.org/abs/2502.16182

  37. [37]

    Kto: Model alignment as prospect theoretic optimization, 2024

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024. URL https://arxiv.org/abs/ 2402.01306

  38. [38]

    Simpo: Simple preference optimization with a reference-free reward, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward, 2024. URLhttps://arxiv.org/abs/2405.14734

  39. [39]

    Adversariallm: A unified and modular toolbox for llm robustness research.arXiv preprint arXiv:2511.04316, 2025

    Tim Beyer, Jonas Dornbusch, Jakob Steimle, Moritz Ladenburger, Leo Schwinn, and Stephan Günnemann. Adversariallm: A unified and modular toolbox for llm robustness research.arXiv preprint arXiv:2511.04316, 2025

  40. [40]

    Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023

  41. [41]

    One model transfer to all: On robust jailbreak prompts generation against llms, 2025

    Linbao Li, Yannan Liu, Daojing He, and Yu Li. One model transfer to all: On robust jailbreak prompts generation against llms, 2025. URLhttps://arxiv.org/abs/2505.17598

  42. [42]

    Ah, our little captive,

    Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, et al. Toward optimal llm alignments using two-player games.arXiv preprint arXiv:2406.10977, 2024. 12 Appendix Overview Appendix A: Algorithmic Description of IHO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...