pith. machine review for the scientific record. sign in

arxiv: 2604.12817 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.CR· stat.ML

Recognition: unknown

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CRstat.ML
keywords continuous adversarial trainingjailbreak robustnessin-context learninglinear transformersgeneralization boundsembedding perturbationsLLM defense
0
0 comments X

The pith

Adversarial training in the embedding space produces a provable defense for LLMs against jailbreak prompts in token space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses in-context learning theory to analyze continuous adversarial training for LLMs. It proves for linear transformers on in-context linear regression that a robust generalization bound tightens negatively with the embedding perturbation radius, explaining CAT's defense against token-space jailbreaks. The bound also depends on singular values of the embedding matrix. This leads to a proposed regularization in the CAT objective to improve the robustness-utility tradeoff, validated on real LLMs.

Core claim

For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, the robust generalization bound has a negative correlation with the perturbation radius in the embedding space, explaining CAT's defense against jailbreak prompts from the token space. The bound's relation to singular values of the embedding matrix motivates adding a regularization term to improve performance on real LLMs.

What carries the argument

The robust generalization bound for adversarially trained linear transformers on in-context linear regression, which is negatively correlated with embedding perturbation radius and depends on embedding matrix singular values.

If this is right

  • Larger embedding perturbation radii lead to stronger robustness bounds.
  • Robustness is controlled by singular values of the embedding matrix.
  • A singular-value-dependent regularization term improves the robustness-utility tradeoff in CAT.
  • Empirical results on LLMs confirm the enhanced defense without major utility loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar bounds might apply to other ICL tasks or model architectures beyond linear regression.
  • The approach suggests embedding matrix design as a path to inherent robustness.
  • Testing the regularization on different attack types could extend its utility.
  • It raises the question of optimal perturbation radii for practical CAT.

Load-bearing premise

Results proven for linear transformers on in-context linear regression tasks extend to real LLMs, with embedding perturbations corresponding to robustness against discrete token jailbreaks.

What would settle it

An experiment showing no improvement in jailbreak defense when increasing embedding perturbation radius during training, or when adding the singular-value regularization, would contradict the central claims.

Figures

Figures reproduced from arXiv: 2604.12817 by Di Wang, Shaopeng Fu.

Figure 1
Figure 1. Figure 1: Evolutions of singular values of the embedding matrix of LLMs along AT. [PITH_FULL_IMAGE:figures/full_fig_p028_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Utility (measured by LC-WinRate) and jailbreak robustness (measured by ASR) on [PITH_FULL_IMAGE:figures/full_fig_p029_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Utility (measured by LC-WinRate) and jailbreak robustness (measured by ASR) on [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗
read the original abstract

Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript claims to deliver the first theoretical analysis of continuous adversarial training (CAT) for LLMs by applying in-context learning (ICL) theory. For linear transformers trained on in-context linear regression tasks using adversarial examples in the embedding space, it proves a robust generalization bound that exhibits a negative correlation with the embedding perturbation radius. This bound is asserted to explain CAT's ability to defend against jailbreak prompts generated in the discrete token space of LLMs. The analysis further links robustness to the singular values of the embedding matrix, motivating a new singular-value-dependent regularizer added to the CAT objective. Experiments on real-world LLMs are reported to show an improved robustness-utility tradeoff, with code released.

Significance. If the connection from the linear ICL model to practical LLMs is substantiated, the work would supply a mechanistic account for why embedding-space perturbations during training confer robustness to token-space attacks, plus a concrete, bound-motivated improvement to CAT via singular-value regularization. The public code release is a clear strength supporting reproducibility. The significance is currently limited by the absence of a formal reduction or approximation result bridging the simplified setting to full LLMs, so the explanatory claim for real models remains an extrapolation whose load-bearing steps are not yet visible.

major comments (3)
  1. [Abstract and §1] Abstract and §1: The assertion that the proven bound 'clearly explains why CAT can defend against jailbreak prompts from the LLM's token space' is not accompanied by any reduction, approximation argument, or formal correspondence showing how continuous embedding-space perturbations in linear regression transfer to discrete token sequences in nonlinear transformers; this link is load-bearing for the central explanatory claim.
  2. [§3] §3 (ICL theory and robust bound): The robust generalization bound is derived exclusively for linear transformers on in-context linear regression tasks; the manuscript provides no analysis of how well the ICL assumptions (linearity, regression task) approximate the attention-based, next-token prediction behavior of real LLMs, leaving the applicability to the target setting unestablished.
  3. [§4 and §5] §4 (singular-value regularizer) and §5 (experiments): The regularizer is motivated by the bound's dependence on singular values, yet the experiments report only aggregate robustness-utility curves without an ablation or measurement confirming that the predicted negative correlation between perturbation radius and robustness holds (or is improved) under the regularizer in the LLM setting.
minor comments (3)
  1. [§3] Notation in §3: The embedding matrix whose singular values appear in the bound and regularizer should be explicitly related to the LLM's token embedding layer to avoid ambiguity when readers attempt to implement the regularizer.
  2. [§5] Figure clarity in §5: The tradeoff plots would benefit from error bars across multiple random seeds or statistical significance markers to allow readers to assess whether the reported improvements are reliable.
  3. [§2] Related work: A brief comparison to prior applications of ICL theory to robustness or adversarial training would help situate the contribution and clarify what is novel versus what follows from existing linear-transformer analyses.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We agree that the connection between the linear ICL model and practical LLMs requires more careful framing and additional validation. We will revise the manuscript to tone down explanatory claims, add discussion of modeling assumptions, and include targeted ablations in the experiments. These changes will be incorporated in the next version.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1: The assertion that the proven bound 'clearly explains why CAT can defend against jailbreak prompts from the LLM's token space' is not accompanied by any reduction, approximation argument, or formal correspondence showing how continuous embedding-space perturbations in linear regression transfer to discrete token sequences in nonlinear transformers; this link is load-bearing for the central explanatory claim.

    Authors: We acknowledge that no formal reduction is provided. The linear ICL setting serves as a tractable proxy that isolates the effect of embedding-space perturbations on generalization, consistent with prior ICL theory. We will revise the abstract and §1 to replace 'clearly explains' with 'provides a theoretical basis for understanding' and add a limitations paragraph explicitly noting the extrapolation to discrete token-space attacks in nonlinear models. revision: yes

  2. Referee: [§3] §3 (ICL theory and robust bound): The robust generalization bound is derived exclusively for linear transformers on in-context linear regression tasks; the manuscript provides no analysis of how well the ICL assumptions (linearity, regression task) approximate the attention-based, next-token prediction behavior of real LLMs, leaving the applicability to the target setting unestablished.

    Authors: The linear transformer and regression task are chosen to permit closed-form analysis of the robust bound, following the standard approach in ICL theory papers. We will expand §3 with a new subsection that (i) justifies the linearity assumption via known approximations of softmax attention by linear attention in certain regimes and (ii) discusses how in-context regression captures key aspects of next-token prediction under embedding perturbations. Relevant citations to approximation results will be added. revision: yes

  3. Referee: [§4 and §5] §4 (singular-value regularizer) and §5 (experiments): The regularizer is motivated by the bound's dependence on singular values, yet the experiments report only aggregate robustness-utility curves without an ablation or measurement confirming that the predicted negative correlation between perturbation radius and robustness holds (or is improved) under the regularizer in the LLM setting.

    Authors: We will add an ablation in §5 that varies the embedding perturbation radius during CAT (with and without the singular-value regularizer) and plots the resulting jailbreak robustness metrics. We will also report the correlation between the singular values of the embedding matrix and observed robustness to directly test the bound's prediction in the LLM experiments. revision: yes

standing simulated objections not resolved
  • A complete formal reduction or approximation theorem that rigorously bridges the linear ICL model to nonlinear transformers and discrete token-space jailbreaks, which lies outside the scope of the current work.

Circularity Check

0 steps flagged

No circularity; bound derived independently from standard ICL theory and used only to motivate regularizer

full rationale

The paper first proves a robust generalization bound for linear transformers on in-context linear regression tasks, showing negative correlation with embedding perturbation radius, using standard ICL analysis that does not depend on real-LLM behavior or the proposed regularizer. This bound is then invoked to motivate adding a singular-value regularization term to the CAT objective. No step reduces the central claim to a fitted parameter, self-citation, or definition of the target LLM result; the linear-model derivation is self-contained and the extension to discrete-token jailbreaks in LLMs is presented as explanatory motivation rather than a formal equivalence. Experiments on real LLMs are separate empirical validation and do not enter the proof.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two modeling assumptions: that linear transformers on in-context linear regression capture the relevant behavior of LLMs under CAT, and that embedding-space perturbations during training translate into robustness against discrete token-space jailbreaks. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Linear transformers trained on in-context linear regression tasks serve as a valid proxy for analyzing CAT in real LLMs
    The proof is performed exclusively on this simplified model; the abstract presents it as explanatory for LLMs.
  • domain assumption Adversarial perturbations in the continuous embedding space during training produce robustness against jailbreak prompts in the discrete token space
    This correspondence is asserted as the key explanatory link but is not derived from first principles in the visible text.

pith-pipeline@v0.9.0 · 5568 in / 1557 out tokens · 38940 ms · 2026-05-10T15:51:46.535761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 24 canonical work pages · 11 internal anchors

  1. [1]

    Defending

    Stephen Casper, Lennart Schulze, Oam Patel, and Dylan Hadfield-Menell. Defending against un- foreseen failure modes with latent adversarial training.arXiv preprint arXiv:2403.05030,

  2. [2]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419,

  3. [3]

    Secalign: Defending against prompt injection with preference optimization

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, and Chuan Guo. Aligning LLMs to be robust against prompt injection.arXiv preprint arXiv:2410.05451,

  4. [4]

    and Vechev, Martin , month = oct, year =

    Csaba D´ek´any, Stefan Balauca, Robin Staab, Dimitar I Dimitrov, and Martin Vechev. MixAT: Com- bining continuous and discrete adversarial training for LLMs.arXiv preprint arXiv:2505.16947,

  5. [5]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233,

  6. [6]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Bal´azs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled Al- pacaEval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

  7. [7]

    arXiv preprint arXiv:2502.04204 , year=

    Shaopeng Fu, Liang Ding, Jingfeng Zhang, and Di Wang. Short-length adversarial training helps llms defend long-length jailbreak attacks: Theoretical and empirical evidence.arXiv preprint arXiv:2502.04204v2,

  8. [8]

    Explaining and Harnessing Adversarial Examples

    11 Published as a conference paper at ICLR 2026 Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572,

  9. [9]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [10]

    In-context convergence of transformers.arXiv preprint arXiv:2310.05249,

    Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers.arXiv preprint arXiv:2310.05249,

  11. [11]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7B.arXiv preprint arXiv:2310.06825,

  12. [12]

    arXiv preprint arXiv:2505.14042 , year=

    Soichiro Kumano, Hiroshi Kera, and Toshihiko Yamasaki. Adversarially pretrained transformers may be universally robust in-context learners.arXiv preprint arXiv:2505.14042,

  13. [13]

    On the robustness of trans- formers against context hijacking for linear classification.arXiv preprint arXiv:2502.15609,

    Tianle Li, Chenyang Zhang, Xingwu Chen, Yuan Cao, and Difan Zou. On the robustness of trans- formers against context hijacking for linear classification.arXiv preprint arXiv:2502.15609,

  14. [14]

    DeepInception: Hypno- tize Large Language Model to Be Jailbreaker

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. DeepInception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

  15. [15]

    doi:10.48550/arXiv.2410.05295 , abstract =

    Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak LLMs.arXiv preprint arXiv:2410.05295, 2024a. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts...

  16. [16]

    Towards deep learning models resistant to adversarial attacks

    12 Published as a conference paper at ICLR 2026 Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations,

  17. [17]

    arXiv preprint arXiv:2410.07746 , year=

    Roey Magen, Shuning Shang, Zhiwei Xu, Spencer Frei, Wei Hu, and Gal Vardi. Benign overfitting in single-head attention.arXiv preprint arXiv:2410.07746,

  18. [18]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A stan- dardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

  19. [19]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InPro- ceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,

  20. [20]

    arXiv preprint arXiv:2407.15549 , year=

    Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, and Stephen Casper. La- tent adversarial training improves robustness to persistent harmful behaviors in LLMs.arXiv preprint arXiv:2407.15549,

  21. [21]

    Intriguing properties of neural networks

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199,

  22. [22]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

  23. [23]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  24. [24]

    Benign overfitting in adversarial training of neural networks

    13 Published as a conference paper at ICLR 2026 Yunjuan Wang, Kaibo Zhang, and Raman Arora. Benign overfitting in adversarial training of neural networks. InInternational Conference on Machine Learning,

  25. [25]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024a. Tong Yang, Yu Huang, Yingbin Liang, and Yuejie Chi. In-context learning with representations: Contextual generalization of trained transformers. InConference on Ne...

  26. [26]

    Black-box optimization of llm outputs by asking for directions,

    Jie Zhang, Meng Ding, Yang Liu, Jue Hong, and Florian Tram `er. Black-box optimization of llm outputs by asking for directions.arXiv preprint arXiv:2510.16794,

  27. [27]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

  28. [28]

    All technical ideas, experimental designs, analyses, conclusions, writing were developed and carried out entirely by the authors

    14 Published as a conference paper at ICLR 2026 A LLMSUSAGEINTHISPAPER LLMs were used only occasionally to help polish the writing (propose new words, grammar and spelling correction). All technical ideas, experimental designs, analyses, conclusions, writing were developed and carried out entirely by the authors. Authors have full responsibility for the f...

  29. [29]

    We then have two observations for Eq

    17 Published as a conference paper at ICLR 2026 Step 1: Show thatw KQ 21 =w V 21 = 0d×1 indicates∂ wKQ 21 ℓ1(θ) =∂ wV 21 ℓ1(θ) = 0 1×d.For thei-th entryw KQ 21,i ofw KQ 21 , we have that ∂wKQ 21,i ℓ1(θ) wKQ 21 =wV 21=0d×1 = 4· E τ h (wV 21)⊤ wV 22 · E(Z τ)E(Z τ)⊤ N · W KQ 11 (wKQ 21 )⊤ ·W Exτ,q −y τ,q · (wV 21)⊤ wV 22 · E(Z τ)E(Z τ)⊤ N · 0d×d e⊤ i ·W Exτ,...

  30. [30]

    Besides, for thei-th elementw V 21,i ofw V 21, we have that ∂wV 21,i ℓ3(θ) wKQ 21 =wV 21=0d×1 = 2ϵ2 N · E τ h 2· (wV 21)⊤ wV 22 · W EXτ Yτ · W EXτ Yτ ⊤ · (ei)⊤ 0 ⊤i wKQ 21 =wV 21=0d×1 · E τ h ∥W KQ 11 W Exτ,q ∥2 2 i = 2ϵ2 N · E τ h 2· (0d×1)⊤ wV 22 · W EXτ Yτ · W EXτ Yτ ⊤ · (ei)⊤ 0 ⊤i · E τ h ∥W KQ 11 W Exτ,q ∥2 2 i = 2ϵ2 N · E τ h 2·w V 22 ·Y τ ·(W EXτ)⊤...

  31. [31]

    The proof is completed

    Besides, for∂ wV 21 ℓ4(θ), we have ∂wV 21 ℓ4(θ) wKQ 21 =wV 21=0d×1 = 2ϵ4 ·[2·(w V 21)⊤]wV 21=0d×1 · E τ h ∥W KQ 11 W Exτ,q ∥2 2 i = 2ϵ4 ·[2·(0 d×1)⊤]· E τ h ∥W KQ 11 W Exτ,q ∥2 2 i = 01×d. The proof is completed. Based on Lemma B.6, we then simplify the objective function ˜Ladv LSAE(θ)in the surrogate ICL em- bedding AT in Eq. (13), as shown in the follow...

  32. [32]

    (14) into the robust risk Radv ρ,M(·)defined in Eq

    Proof of Theorem 2.For the converged LSA-E modelf LSAE,θ∗(·)trained from the surrogate ICL embedding AT, by inserting its prediction functionˆy q,θ∗(·)given in Eq. (14) into the robust risk Radv ρ,M(·)defined in Eq. (12) and using the inequality|a+b| 2 ≤2(a 2 +b 2), we have that Radv ρ,M(θ∗) = E τ max ∥∆O⊤τ ∥2,∞≤ρ 1 2 1 N Yτ(Xτ + (0d0×(N−M) ∆O τ ))⊤ ·(W E...

  33. [33]

    25 Published as a conference paper at ICLR 2026 Table 5: LC-WinRate on models trained via CAT with different toward/away cut-off thresholds. Type (Toward / Away Cut-offs) (Utility) LC-WinRate (%)↑ Vicuna-7B Mistral-7B Llama-2-7B Llama-3.1-8B Qwen2.5-7B Gemma-2B Original 76.86 90.96 86.70 85.99 91.14 63.96 CAT (1.0/−3.0) 36.66 15.76 67.51 45.71 77.07 41.75...

  34. [34]

    Xhonneux et al

    suggest “cut-off” each loss functionL′ as L=I[L ′]·0.999c+ (I[L ′ > c]·0.001 +I[L ′ ≤c])· L ′ To prevent over-optimizing both the toward and away losses, we use a cut-off parametercand the indicator functionI[·]. Xhonneux et al. (2024) originally set the cut-off value as0.5for the toward loss and−5.0for the away loss. However, we have empirically found th...

  35. [35]

    (2024) can result in the LC-WinRate of Vicuna being23.60%, which is significantly lower when compared with its original LC-WinRate of76.86%before finetuning

    From the table, we find that, for example, the original cut-off setting from Xhonneux et al. (2024) can result in the LC-WinRate of Vicuna being23.60%, which is significantly lower when compared with its original LC-WinRate of76.86%before finetuning. As a result, in our experiments, we relax the cut-off values to1.0for the toward loss and−3.0for the away ...

  36. [36]

    For the embedding layer, we set its PEFT hyperparmeters asr=1024,lora alpha=32, and lora dropout=0.1

    to the embedding layer and all query and key projection matrices in attention layers of LLMs. For the embedding layer, we set its PEFT hyperparmeters asr=1024,lora alpha=32, and lora dropout=0.1. Besides, for the remaining layers, we set their PEFT hyperparameters as r=64,lora alpha=32, andlora dropout=0.1. Adversarial training.We use AdamW to train each ...

  37. [37]

    Sorry, I can’t do that

    is applied to the embedding layer and all query and key projection matrices in attention layers. Other AT hyperparameters.In every AT experiment, we perform LLM AT with AdamW for60 iterations, where the learning rate is fixed to2×10 −4. The batch size is set as64, where8samples are adversarial inputs and the remaining56samples are utility inputs. We alway...

  38. [38]

    batch-size

    and PAIR (Chao et al., 2023). We re- implemented all six attacks by ourselves to enable efficient and fair jailbreak evaluations. Addi- tionally, for every suffix attack, the length of adversairal suffix token length is set as20. Other hyperparameters of jailbreak attacks are set as follows: •GCG:According to Algorithm 1 in Zou et al. (2023), hyperparamet...

  39. [39]

    •GCQ:According to Algorithm 1 in Hayase et al

    We setk 1 as64andk 2 as16. •GCQ:According to Algorithm 1 in Hayase et al. (2024), hyperparameters that we need to set for GCQ include the iteration numberT, the proxy batch sizeb p, the query batch size bq, and the buffer sizeB. We setT= 200andb p =b q =B=

  40. [40]

    (2024), hyper- parameters that we need to set for Zhu’s AutoDAN are the iteration numberTin each step, objective weightsw 1 andw 2, the top-BparameterB, and the temperatureτ

    •Zhu’s AutoDAN:According to Algorithm 1 and Algorithm 2 in Zhu et al. (2024), hyper- parameters that we need to set for Zhu’s AutoDAN are the iteration numberTin each step, objective weightsw 1 andw 2, the top-BparameterB, and the temperatureτ. We setTas 3,w 1 as10,w 2 as100,Bas256, andτas2. •DeepInception:According to Li et al. (2023), DeepInception atta...

  41. [41]

    Worst-case

    From the figure, we find that when compared with the original CAT method, our ER-CAT can optimize the LLM embedding matrix to: (1) reduce its maximum singular value, (2) increase its minimum singular value, (3) reduce the standard deviation of all its singular values, and (4) do not change the mean of singular values too much (the change of mean is less t...