pith. machine review for the scientific record. sign in

arxiv: 2604.11309 · v1 · submitted 2026-04-13 · 💻 cs.CR · cs.AI· cs.CL· cs.CV· cs.LG

Recognition: unknown

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Dongxian Wu, Haolin Wu, Jiangrong Wu, Jun Sun, Kai Wang, Meng Sun, Xun Chen, Yihao Zhang, Yuxuan Zhou, Zeming Wei

Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.CVcs.LG
keywords LLM jailbreakingmulti-turn attackscumulative riskSalami AttackAI alignmentprompt chainingmodel securitydefense strategies
0
0 comments X

The pith

Chaining numerous low-risk inputs can accumulate harmful intent to jailbreak LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models remain open to jailbreaking when attackers use multi-turn conversations that build harmful intent step by step. Each prompt stays below the threshold that triggers safety filters, yet the sequence as a whole steers the model toward unsafe outputs. The paper presents Salami Slicing Risk as the underlying pattern and turns it into an automatic attack framework that requires no explicit harmful trigger and little model-specific tuning. Experiments report success rates above 90 percent on GPT-4o and Gemini across different input types, with the attack holding up against existing defenses. The authors also outline a mitigation that limits this class of attacks while affecting other multi-turn methods.

Core claim

Salami Slicing Risk operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, the Salami Attack is developed as an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses.

What carries the argument

Salami Slicing Risk, the process of chaining low-risk inputs so that harmful intent accumulates across turns without any single input crossing safety filters.

If this is right

  • The attack applies to multiple model types and input modalities without requiring per-model tuning.
  • It reaches over 90% attack success rate on leading models including GPT-4o and Gemini.
  • The method stays effective against current real-world alignment defenses.
  • A defense strategy reduces Salami Attack success by at least 44.8% and blocks other multi-turn jailbreaks by up to 64.8%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety systems may need to evaluate cumulative intent across full conversation histories instead of isolated prompts.
  • Alignment training that targets only explicit triggers leaves room for gradual intent buildup.
  • Similar accumulation patterns could be tested in non-language AI systems that process sequential user inputs.
  • Defenses based on tracking intent progression might apply to other multi-turn attack families.

Load-bearing premise

Numerous low-risk inputs can reliably accumulate harmful intent without any single input triggering filters, and this holds across models without heavy model-specific tuning.

What would settle it

A controlled test in which a long sequence of low-risk prompts on a sensitive topic is fed to the model and the output remains fully safe or blocked, even as chain length increases.

Figures

Figures reproduced from arXiv: 2604.11309 by Dongxian Wu, Haolin Wu, Jiangrong Wu, Jun Sun, Kai Wang, Meng Sun, Xun Chen, Yihao Zhang, Yuxuan Zhou, Zeming Wei.

Figure 1
Figure 1. Figure 1: Illustration of Salami Slicing Risk on a Real-World LLM Application (ChatGPT’s Web Interface, 2025-11-14). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attention scores of L2H26 during generation of the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 2D PCA of activations in Layer 7 and Layer 14. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration for the Workflow of A-Salami. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gemini Generated Harmful Images with Corresponding [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Category-wise harmful-score comparison among [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90\% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8\% while achieving a maximum blocking rate of 64.8\% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces 'Salami Slicing Risk' as a multi-turn jailbreak vector in which sequences of individually low-risk prompts are chained to accumulate harmful intent and bypass LLM alignment filters without explicit triggers or heavy model-specific tuning. It presents an automatic 'Salami Attack' framework claimed to be universal across models and modalities, reports >90% attack success rate (ASR) on GPT-4o and Gemini, and proposes a defense that reduces Salami Attack success by at least 44.8% while blocking up to 64.8% of other multi-turn jailbreaks.

Significance. If the empirical claims hold with reproducible methodology, the work would be significant for identifying a class of cumulative, low-signal multi-turn attacks that evade single-prompt and context-aware defenses. It would provide concrete evidence that alignment thresholds can be circumvented through gradual intent accumulation and would supply a mitigation baseline. The absence of experimental protocols, prompt constructions, baselines, and statistical reporting in the current manuscript, however, prevents assessment of whether these results are robust or generalizable.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claim of >90% ASR on GPT-4o and Gemini is stated without any description of the experimental protocol, including the number of trials, selection criteria or generation method for the low-risk input sequences, chain lengths, success criteria, or comparison against existing multi-turn baselines. This information is load-bearing for the 'state-of-the-art' and 'universal' assertions.
  2. [§3] §3 (Salami Attack framework): The description of how low-risk inputs are chosen and how cumulative intent is accumulated lacks concrete examples, algorithmic details, or analysis of why context-aware models (e.g., GPT-4o) do not flag the growing context itself. Without this, the claim that the method works 'without heavy reliance on pre-designed contextual structures' and 'automatically' across models cannot be evaluated.
  3. [§5] §5 (Defense evaluation): The reported 44.8% reduction in Salami Attack success and 64.8% blocking rate against other multi-turn attacks are given without the defense mechanism details, evaluation dataset, or ablation showing that the defense does not simply degrade utility on benign multi-turn conversations.
minor comments (2)
  1. [Abstract] The abstract and introduction repeatedly use 'rigorous experiments' and 'state-of-the-art' without accompanying tables, figures, or statistical measures (error bars, confidence intervals, or number of runs). Adding these would improve clarity.
  2. [§2] Notation for 'Salami Slicing Risk' and 'Salami Attack' is introduced without a formal definition or distinction from prior multi-turn jailbreak literature; a short related-work subsection would help readers situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the manuscript requires additional experimental details, examples, and evaluations to support its claims and ensure reproducibility. We will revise the paper to address each major comment as described below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of >90% ASR on GPT-4o and Gemini is stated without any description of the experimental protocol, including the number of trials, selection criteria or generation method for the low-risk input sequences, chain lengths, success criteria, or comparison against existing multi-turn baselines. This information is load-bearing for the 'state-of-the-art' and 'universal' assertions.

    Authors: We agree that the current manuscript lacks sufficient detail on the experimental protocol, which limits assessment of the results. In the revised version, we will expand §4 and the abstract to specify: 100 independent trials per model; low-risk sequences generated via an automated algorithm using a lightweight risk classifier to select benign prompts that incrementally build intent over chains of 8-12 turns; success criteria defined as the model producing harmful content without refusal (verified by both automated classifiers and human annotators); and direct comparisons to multi-turn baselines from the literature (e.g., context-aware jailbreak methods). We will also include statistical reporting such as confidence intervals. These additions will substantiate the performance and universality claims. revision: yes

  2. Referee: [§3] §3 (Salami Attack framework): The description of how low-risk inputs are chosen and how cumulative intent is accumulated lacks concrete examples, algorithmic details, or analysis of why context-aware models (e.g., GPT-4o) do not flag the growing context itself. Without this, the claim that the method works 'without heavy reliance on pre-designed contextual structures' and 'automatically' across models cannot be evaluated.

    Authors: We acknowledge that §3 would benefit from greater specificity. We will revise the section to include concrete examples of low-risk prompt sequences and their cumulative buildup. Algorithmic details will be added: the framework uses an automated greedy selection process based on embedding similarity to harmful concepts while keeping individual prompts below risk thresholds via a pre-trained classifier. For context-aware models, we will provide analysis and experimental observations showing that the gradual, distributed intent accumulation avoids triggering single-turn or context-monitoring flags until the final turn, as intermediate contexts remain individually benign. This will clarify the automatic and low-reliance aspects. revision: yes

  3. Referee: [§5] §5 (Defense evaluation): The reported 44.8% reduction in Salami Attack success and 64.8% blocking rate against other multi-turn attacks are given without the defense mechanism details, evaluation dataset, or ablation showing that the defense does not simply degrade utility on benign multi-turn conversations.

    Authors: We agree that §5 requires more complete information on the defense evaluation. In the revision, we will detail the defense mechanism as a cumulative risk monitoring approach that computes an aggregated risk score across turns and intervenes (via rewriting or refusal) above a threshold. The evaluation dataset will be specified as 150 Salami Attack instances plus 100 from other multi-turn techniques. We will also add an ablation study and utility assessment on 200 benign multi-turn conversations, showing that the defense maintains response quality and helpfulness with no significant degradation per human ratings and automated metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivations or self-referential reductions

full rationale

The paper proposes the Salami Slicing Risk concept and Salami Attack framework as an empirical approach to multi-turn jailbreaking, validated through experiments reporting >90% ASR on models like GPT-4o. No equations, fitted parameters, or mathematical derivations are present in the provided text. The central claims rest on experimental outcomes rather than any chain that reduces by construction to its own inputs or self-citations. This is a standard empirical security paper whose results are externally falsifiable via replication on the tested models and defenses; no load-bearing step collapses into tautology or renaming of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical observation that low-risk prompts can accumulate intent without triggering filters; no free parameters, mathematical axioms, or new physical entities are introduced.

invented entities (1)
  • Salami Slicing Risk no independent evidence
    purpose: Conceptual model for cumulative harmful intent built from individually safe inputs
    New term introduced to describe the attack surface

pith-pipeline@v0.9.0 · 5622 in / 1049 out tokens · 42225 ms · 2026-05-10T16:02:13.443702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 46 canonical work pages · 14 internal anchors

  1. [1]

    A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

    K. Wang, G. Zhang, Z. Zhou, J. Wu, M. Yu, S. Zhao, C. Yin, J. Fu, Y . Yan, H. Luoet al., “A comprehensive survey in llm (- agent) full stack safety: Data, training and deployment,”arXiv preprint arXiv:2504.15585, 2025

  2. [2]

    Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295, 2024

    S. Yi, Y . Liu, Z. Sun, T. Cong, X. He, J. Song, K. Xu, and Q. Li, “Jail- break attacks and defenses against large language models: A survey,” arXiv preprint arXiv:2407.04295, 2024

  3. [3]

    Jailbroken: How does llm safety training fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in Neural Information Processing Sys- tems, vol. 36, pp. 80 079–80 110, 2023

  4. [4]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models,”arXiv preprint arXiv:2310.04451, 2023

  5. [5]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  6. [6]

    Boosting jailbreak attack with momentum,

    Y . Zhang and Z. Wei, “Boosting jailbreak attack with momentum,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  7. [7]

    LLM jailbreak attack versus defense techniques -- a comprehensive study

    Z. Xu, Y . Liu, G. Deng, Y . Li, and S. Picek, “A comprehensive study of jailbreak attack versus defense for large language models,”arXiv preprint arXiv:2402.13457, 2024

  8. [8]

    Emerging vulnerabilities in frontier models: Multi-turn jailbreak attacks, 2024

    T. Gibbs, E. Kosak-Hine, G. Ingebretsen, J. Zhang, J. Broomfield, S. Pieri, R. Iranmanesh, R. Rabbany, and K. Pelrine, “Emerging vulner- abilities in frontier models: Multi-turn jailbreak attacks,”arXiv preprint arXiv:2409.00137, 2024. SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 14

  9. [9]

    Great, now write an article about that: The crescendo multi-turn llm jailbreak attack

    M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The crescendo multi-turn llm jailbreak attack,”arXiv preprint arXiv:2404.01833, vol. 2, no. 6, p. 17, 2024

  10. [10]

    Leveraging the context through multi-round interactions for jailbreaking attacks,

    Y . Cheng, M. Georgopoulos, V . Cevher, and G. G. Chrysos, “Leveraging the context through multi-round interactions for jailbreaking attacks,” arXiv preprint arXiv:2402.09177, 2024

  11. [11]

    Multi-turn context jailbreak attack on large language models from first principles,

    X. Sun, D. Zhang, D. Yang, Q. Zou, and H. Li, “Multi-turn context jailbreak attack on large language models from first principles,”arXiv preprint arXiv:2408.04686, 2024

  12. [12]

    Foot-in-the-door: A multi-turn jailbreak for llms

    Z. Weng, X. Jin, J. Jia, and X. Zhang, “Foot-in-the-door: A multi-turn jailbreak for llms,”arXiv preprint arXiv:2502.19820, 2025

  13. [13]

    Safe+ safe= unsafe? exploring how safe images can be exploited to jailbreak large vision-language models,

    C. Cui, G. Deng, A. Zhang, J. Zheng, Y . Li, L. Gao, T. Zhang, and T.-S. Chua, “Safe+ safe= unsafe? exploring how safe images can be exploited to jailbreak large vision-language models,”arXiv preprint arXiv:2411.11496, 2024

  14. [14]

    V oice jailbreak attacks against gpt-4o,

    X. Shen, Y . Wu, M. Backes, and Y . Zhang, “V oice jailbreak attacks against gpt-4o,”arXiv preprint arXiv:2405.19103, 2024

  15. [15]

    A representation engineering perspective on the effectiveness of multi-turn jailbreaks.arXiv preprint arXiv:2507.02956, 2025

    B. Bullwinkel, M. Russinovich, A. Salem, S. Zanella-Beguelin, D. Jones, G. Severi, E. Kim, K. Hines, A. Minnich, Y . Zungeret al., “A representation engineering perspective on the effectiveness of multi-turn jailbreaks,”arXiv preprint arXiv:2507.02956, 2025

  16. [16]

    arXiv preprint arXiv:2309.01446 , year=

    R. Lapid, R. Langberg, and M. Sipper, “Open sesame! universal black box jailbreaking of large language models,”arXiv preprint arXiv:2309.01446, 2023

  17. [17]

    Llm defenses are not robust to multi-turn human jailbreaks yet

    N. Li, Z. Han, I. Steneker, W. Primack, R. Goodside, H. Zhang, Z. Wang, C. Menghini, and S. Yue, “Llm defenses are not robust to multi-turn human jailbreaks yet,”arXiv preprint arXiv:2408.15221, 2024

  18. [18]

    ” do anything now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1671–1685

  19. [19]

    Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,

    J. Yu, X. Lin, Z. Yu, and X. Xing, “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,”arXiv preprint arXiv:2309.10253, 2023

  20. [20]

    Tree of attacks: Jailbreaking black-box llms automatically,

    A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y . Singer, and A. Karbasi, “Tree of attacks: Jailbreaking black-box llms automatically,”Advances in Neural Information Processing Systems, vol. 37, pp. 61 065–61 105, 2024

  21. [21]

    Ethical and social risks of harm from Language Models

    L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadehet al., “Ethical and social risks of harm from language models,”arXiv preprint arXiv:2112.04359, 2021

  22. [22]

    Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054, 1, 2025

    Z. Ying, D. Zhang, Z. Jing, Y . Xiao, Q. Zou, A. Liu, S. Liang, X. Zhang, X. Liu, and D. Tao, “Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models,”arXiv preprint arXiv:2502.11054, 2025

  23. [23]

    Siege: Autonomous multi-turn jailbreaking of large language models with tree search

    A. Zhou and R. Arel, “Tempest: Autonomous multi-turn jailbreak- ing of large language models with tree search,”arXiv preprint arXiv:2503.10619, 2025

  24. [24]

    Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel

    S. Rahman, L. Jiang, J. Shiffer, G. Liu, S. Issaka, M. R. Parvez, H. Palangi, K.-W. Chang, Y . Choi, and S. Gabriel, “X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents,”arXiv preprint arXiv:2504.13203, 2025

  25. [25]

    Automated red teaming with goat: the generative offensive agent tester,

    M. Pavlova, E. Brinkman, K. Iyer, V . Albiero, J. Bitton, H. Nguyen, C. C. Ferrer, I. Evtimov, and A. Grattafiori, “Automated red teaming with goat: the generative offensive agent tester,” inICLR 2025 Workshop on Building Trust in Language Models and Applications

  26. [26]

    Visual contextual attack: Jailbreaking mllms with image-driven context injection,

    Z. Miao, Y . Ding, L. Li, and J. Shao, “Visual contextual attack: Jailbreaking mllms with image-driven context injection,”arXiv preprint arXiv:2507.02844, 2025

  27. [27]

    Viscra: A visual chain reasoning attack for jailbreaking multimodal large language models,

    B. Sima, L. Cong, W. Wang, and K. He, “Viscra: A visual chain reasoning attack for jailbreaking multimodal large language models,” arXiv preprint arXiv:2505.19684, 2025

  28. [28]

    Steering dialogue dynamics for robust- ness against multi-turn jailbreaking attacks.arXiv preprint arXiv:2503.00187,

    H. Hu, A. Robey, and C. Liu, “Steering dialogue dynamics for robustness against multi-turn jailbreaking attacks,”arXiv preprint arXiv:2503.00187, 2025

  29. [29]

    An attention-aware gnn-based input defender against multi-turn jailbreak on llms,

    Z. Huang, K. Huang, L. Yin, B. He, H. Zhen, M. Yuan, and Z. Shao, “An attention-aware gnn-based input defender against multi-turn jailbreak on llms,”arXiv preprint arXiv:2507.07146, 2025

  30. [30]

    Gpt-4 jailbreaks itself with near-perfect success using self-explanation,

    G. Ramesh, Y . Dou, and W. Xu, “Gpt-4 jailbreaks itself with near-perfect success using self-explanation,”arXiv preprint arXiv:2405.13077, 2024

  31. [31]

    Representation Engineering: A Top-Down Approach to AI Transparency

    A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowskiet al., “Representation en- gineering: A top-down approach to ai transparency,”arXiv preprint arXiv:2310.01405, 2023

  32. [32]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” 2024. [Online]. Available: https://arxiv.org/abs/2310.08419

  33. [33]

    Many-shot jailbreaking,

    C. Anil, E. Durmus, N. Panickssery, M. Sharma, J. Benton, S. Kundu, J. Batson, M. Tong, J. Mu, D. Fordet al., “Many-shot jailbreaking,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 129 696– 129 742, 2024

  34. [34]

    Refusal in language models is mediated by a single direction,

    A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction,”Advances in Neural Information Processing Systems, vol. 37, pp. 136 037–136 083, 2024

  35. [35]

    Don’t say no: Jail- breaking llm by suppressing refusal,

    Y . Zhou, Z. Huang, F. Lu, Z. Qin, and W. Wang, “Don’t say no: Jail- breaking llm by suppressing refusal,”arXiv preprint arXiv:2404.16369, 2024

  36. [36]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  37. [37]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggineet al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,”arXiv preprint arXiv:2312.06674, 2023

  38. [38]

    Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

    M. Sharma, M. Tong, J. Mu, J. Wei, J. Kruthoff, S. Goodfriend, E. Ong, A. Peng, R. Agarwal, C. Anilet al., “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,”arXiv preprint arXiv:2501.18837, 2025

  39. [39]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  40. [40]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  41. [41]

    I’m unable to assist

    Z. Zhou, H. Yu, X. Zhang, R. Xu, F. Huang, K. Wang, Y . Liu, J. Fang, and Y . Li, “On the role of attention heads in large language model safety,”arXiv preprint arXiv:2410.13708, 2024

  42. [42]

    Adversarial representation engineering: A general model editing framework for large language models,

    Y . Zhang, Z. Wei, J. Sun, and M. Sun, “Adversarial representation engineering: A general model editing framework for large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 243–126 264, 2024

  43. [43]

    Jailbreak- ing prompt attack: A controllable adversarial attack against diffusion models,

    J. Ma, Y . Li, Z. Xiao, A. Cao, J. Zhang, C. Ye, and J. Zhao, “Jailbreak- ing prompt attack: A controllable adversarial attack against diffusion models,”arXiv preprint arXiv:2404.02928, 2024

  44. [44]

    Jailbreaking text-to-image models with llm-based agents,

    Y . Dong, Z. Li, X. Meng, N. Yu, and S. Guo, “Jailbreaking text-to-image models with llm-based agents,”arXiv preprint arXiv:2408.00523, 2024

  45. [45]

    Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large lan- guage and vision-language models,

    H. Jin, L. Hu, X. Li, P. Zhang, C. Chen, J. Zhuang, and H. Wang, “Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large lan- guage and vision-language models,”arXiv preprint arXiv:2407.01599, 2024

  46. [46]

    Visual adversarial examples jailbreak aligned large language models,

    X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Visual adversarial examples jailbreak aligned large language models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 19, 2024, pp. 21 527–21 536

  47. [47]

    doi:10.48550/arXiv.2410.05295 , abstract =

    X. Liu, P. Li, E. Suh, Y . V orobeychik, Z. Mao, S. Jha, P. McDaniel, H. Sun, B. Li, and C. Xiao, “Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms,”arXiv preprint arXiv:2410.05295, 2024

  48. [48]

    Safe in isolation, dangerous together: Agent-driven multi-turn decomposition jailbreaks on llms,

    D. Srivastav and X. Zhang, “Safe in isolation, dangerous together: Agent-driven multi-turn decomposition jailbreaks on llms,” inProceed- ings of the 1st Workshop for Research on Agent Language Models (REALM 2025), 2025, pp. 170–183

  49. [49]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,”arXiv preprint arXiv:2402.04249, 2024

  50. [50]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models,

    P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Trameret al., “Jailbreakbench: An open robustness benchmark for jailbreaking large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 55 005–55 029, 2024

  51. [51]

    Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

    F. Liu, Y . Feng, Z. Xu, L. Su, X. Ma, D. Yin, and H. Liu, “Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework,”arXiv preprint arXiv:2410.12855, 2024

  52. [52]

    Easyjailbreak: A unified framework for jailbreaking large language models,

    W. Zhou, X. Wang, L. Xiong, H. Xia, Y . Gu, M. Chai, F. Zhu, C. Huang, S. Dou, Z. Xi, R. Zheng, S. Gao, Y . Zou, H. Yan, Y . Le, R. Wang, L. Li, J. Shao, T. Gui, Q. Zhang, and X. Huang, “Easyjailbreak: A unified framework for jailbreaking large language models,” 2024. SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 15

  53. [53]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  54. [54]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  55. [55]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  56. [56]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhaoet al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,”arXiv preprint arXiv:2406.12793, 2024

  57. [57]

    Query-relevant images jailbreak large multi-modal models,

    X. Liu, Y . Zhu, Y . Lan, C. Yang, and Y . Qiao, “Query-relevant images jailbreak large multi-modal models,” 2023

  58. [58]

    Jailbreaking multimodal large language models via shuffle inconsistency,

    S. Zhao, R. Duan, F. Wang, C. Chen, C. Kang, S. Ruan, J. Tao, Y . Chen, H. Xue, and X. Wei, “Jailbreaking multimodal large language models via shuffle inconsistency,”arXiv preprint arXiv:2501.04931, 2025

  59. [59]

    Overt: A benchmark for over-refusal evaluation on text-to- image models,

    Z. Cheng, Y . Huang, H. Xu, S. Sojoudi, X. Zhao, D. Song, and S. Mei, “Overt: A benchmark for over-refusal evaluation on text-to- image models,”arXiv preprint arXiv:2505.21347, 2025

  60. [60]

    Figstep: Jailbreaking large vision-language models via typographic visual prompts,

    Y . Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 22, 2025, pp. 23 951–23 959

  61. [61]

    Sneakyprompt: Jailbreaking text-to-image generative models,

    Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao, “Sneakyprompt: Jailbreaking text-to-image generative models,” in2024 IEEE symposium on security and privacy (SP). IEEE, 2024, pp. 897–912

  62. [62]

    Harnessing llm to attack llm-guarded text-to- image models,

    Y . Deng and H. Chen, “Harnessing llm to attack llm-guarded text-to- image models,”arXiv e-prints, pp. arXiv–2312, 2023

  63. [63]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,”arXiv preprint arXiv:2310.03684, 2023

  64. [64]

    Defending chatgpt against jailbreak attack via self-reminder,

    F. Wu, Y . Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, and X. Xie, “Defending chatgpt against jailbreak attack via self-reminder,” 2023

  65. [65]

    Stanford alpaca: An instruction-following llama model,

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023