arxiv: 2604.11309 · v1 · submitted 2026-04-13 · 💻 cs.CR · cs.AI· cs.CL· cs.CV· cs.LG

Recognition: unknown

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Dongxian Wu, Haolin Wu, Jiangrong Wu, Jun Sun, Kai Wang, Meng Sun, Xun Chen, Yihao Zhang, Yuxuan Zhou, Zeming Wei

Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.CVcs.LG

keywords LLM jailbreakingmulti-turn attackscumulative riskSalami AttackAI alignmentprompt chainingmodel securitydefense strategies

0 comments

The pith

Chaining numerous low-risk inputs can accumulate harmful intent to jailbreak LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models remain open to jailbreaking when attackers use multi-turn conversations that build harmful intent step by step. Each prompt stays below the threshold that triggers safety filters, yet the sequence as a whole steers the model toward unsafe outputs. The paper presents Salami Slicing Risk as the underlying pattern and turns it into an automatic attack framework that requires no explicit harmful trigger and little model-specific tuning. Experiments report success rates above 90 percent on GPT-4o and Gemini across different input types, with the attack holding up against existing defenses. The authors also outline a mitigation that limits this class of attacks while affecting other multi-turn methods.

Core claim

Salami Slicing Risk operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, the Salami Attack is developed as an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses.

What carries the argument

Salami Slicing Risk, the process of chaining low-risk inputs so that harmful intent accumulates across turns without any single input crossing safety filters.

If this is right

The attack applies to multiple model types and input modalities without requiring per-model tuning.
It reaches over 90% attack success rate on leading models including GPT-4o and Gemini.
The method stays effective against current real-world alignment defenses.
A defense strategy reduces Salami Attack success by at least 44.8% and blocks other multi-turn jailbreaks by up to 64.8%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety systems may need to evaluate cumulative intent across full conversation histories instead of isolated prompts.
Alignment training that targets only explicit triggers leaves room for gradual intent buildup.
Similar accumulation patterns could be tested in non-language AI systems that process sequential user inputs.
Defenses based on tracking intent progression might apply to other multi-turn attack families.

Load-bearing premise

Numerous low-risk inputs can reliably accumulate harmful intent without any single input triggering filters, and this holds across models without heavy model-specific tuning.

What would settle it

A controlled test in which a long sequence of low-risk prompts on a sensitive topic is fed to the model and the output remains fully safe or blocked, even as chain length increases.

Figures

Figures reproduced from arXiv: 2604.11309 by Dongxian Wu, Haolin Wu, Jiangrong Wu, Jun Sun, Kai Wang, Meng Sun, Xun Chen, Yihao Zhang, Yuxuan Zhou, Zeming Wei.

**Figure 2.** Figure 2: Attention scores of L2H26 during generation of the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: 2D PCA of activations in Layer 7 and Layer 14. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration for the Workflow of A-Salami. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Gemini Generated Harmful Images with Corresponding [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Category-wise harmful-score comparison among [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90\% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8\% while achieving a maximum blocking rate of 64.8\% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Salami slicing gives a useful new framing for gradual multi-turn jailbreaks but the abstract leaves the actual method and evidence too thin to judge the 90%+ claims.

read the letter

The main takeaway is a fresh way to think about multi-turn attacks: chain many individually low-risk prompts so harmful intent builds up without any single one tripping the filters. This sidesteps the explicit-trigger problem that context-aware models are getting better at catching, and it tries to avoid the heavy model-specific tuning that other multi-turn methods often need. The authors turn that idea into an automatic framework they call Salami Attack and test it across models and modalities, plus they sketch a defense that cuts the attack by at least 44.8% while also hitting other multi-turn jailbreaks at up to 64.8% block rate. That combination of attack framing plus mitigation is the part that feels worth discussing in the LLM security community. The reported numbers, over 90% ASR on GPT-4o and Gemini with claimed robustness, would matter for real deployments if they hold up. The soft spots are exactly where the stress-test note flags them. The abstract gives no prompt examples, no selection criteria for the low-risk inputs, and no experimental details on baselines, error bars, or how the chain avoids activating context safety. Without those, the accumulation claim stays unverified and the high success rates cannot be assessed. The defense numbers are also presented without enough method to see how general they are. The paper is empirical rather than theoretical, so there are no equations or fitting issues to worry about, and the citations look standard for the jailbreak literature. This is for people doing red-teaming or building defenses for deployed LLMs. A reader who wants concrete ideas for conversation-level risks could get value from the framing even before the numbers are fully checked. I would send it to peer review so the methods and full results can be examined properly; the idea is practical enough that it deserves that step rather than a desk rejection.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces 'Salami Slicing Risk' as a multi-turn jailbreak vector in which sequences of individually low-risk prompts are chained to accumulate harmful intent and bypass LLM alignment filters without explicit triggers or heavy model-specific tuning. It presents an automatic 'Salami Attack' framework claimed to be universal across models and modalities, reports >90% attack success rate (ASR) on GPT-4o and Gemini, and proposes a defense that reduces Salami Attack success by at least 44.8% while blocking up to 64.8% of other multi-turn jailbreaks.

Significance. If the empirical claims hold with reproducible methodology, the work would be significant for identifying a class of cumulative, low-signal multi-turn attacks that evade single-prompt and context-aware defenses. It would provide concrete evidence that alignment thresholds can be circumvented through gradual intent accumulation and would supply a mitigation baseline. The absence of experimental protocols, prompt constructions, baselines, and statistical reporting in the current manuscript, however, prevents assessment of whether these results are robust or generalizable.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central claim of >90% ASR on GPT-4o and Gemini is stated without any description of the experimental protocol, including the number of trials, selection criteria or generation method for the low-risk input sequences, chain lengths, success criteria, or comparison against existing multi-turn baselines. This information is load-bearing for the 'state-of-the-art' and 'universal' assertions.
[§3] §3 (Salami Attack framework): The description of how low-risk inputs are chosen and how cumulative intent is accumulated lacks concrete examples, algorithmic details, or analysis of why context-aware models (e.g., GPT-4o) do not flag the growing context itself. Without this, the claim that the method works 'without heavy reliance on pre-designed contextual structures' and 'automatically' across models cannot be evaluated.
[§5] §5 (Defense evaluation): The reported 44.8% reduction in Salami Attack success and 64.8% blocking rate against other multi-turn attacks are given without the defense mechanism details, evaluation dataset, or ablation showing that the defense does not simply degrade utility on benign multi-turn conversations.

minor comments (2)

[Abstract] The abstract and introduction repeatedly use 'rigorous experiments' and 'state-of-the-art' without accompanying tables, figures, or statistical measures (error bars, confidence intervals, or number of runs). Adding these would improve clarity.
[§2] Notation for 'Salami Slicing Risk' and 'Salami Attack' is introduced without a formal definition or distinction from prior multi-turn jailbreak literature; a short related-work subsection would help readers situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the manuscript requires additional experimental details, examples, and evaluations to support its claims and ensure reproducibility. We will revise the paper to address each major comment as described below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of >90% ASR on GPT-4o and Gemini is stated without any description of the experimental protocol, including the number of trials, selection criteria or generation method for the low-risk input sequences, chain lengths, success criteria, or comparison against existing multi-turn baselines. This information is load-bearing for the 'state-of-the-art' and 'universal' assertions.

Authors: We agree that the current manuscript lacks sufficient detail on the experimental protocol, which limits assessment of the results. In the revised version, we will expand §4 and the abstract to specify: 100 independent trials per model; low-risk sequences generated via an automated algorithm using a lightweight risk classifier to select benign prompts that incrementally build intent over chains of 8-12 turns; success criteria defined as the model producing harmful content without refusal (verified by both automated classifiers and human annotators); and direct comparisons to multi-turn baselines from the literature (e.g., context-aware jailbreak methods). We will also include statistical reporting such as confidence intervals. These additions will substantiate the performance and universality claims. revision: yes
Referee: [§3] §3 (Salami Attack framework): The description of how low-risk inputs are chosen and how cumulative intent is accumulated lacks concrete examples, algorithmic details, or analysis of why context-aware models (e.g., GPT-4o) do not flag the growing context itself. Without this, the claim that the method works 'without heavy reliance on pre-designed contextual structures' and 'automatically' across models cannot be evaluated.

Authors: We acknowledge that §3 would benefit from greater specificity. We will revise the section to include concrete examples of low-risk prompt sequences and their cumulative buildup. Algorithmic details will be added: the framework uses an automated greedy selection process based on embedding similarity to harmful concepts while keeping individual prompts below risk thresholds via a pre-trained classifier. For context-aware models, we will provide analysis and experimental observations showing that the gradual, distributed intent accumulation avoids triggering single-turn or context-monitoring flags until the final turn, as intermediate contexts remain individually benign. This will clarify the automatic and low-reliance aspects. revision: yes
Referee: [§5] §5 (Defense evaluation): The reported 44.8% reduction in Salami Attack success and 64.8% blocking rate against other multi-turn attacks are given without the defense mechanism details, evaluation dataset, or ablation showing that the defense does not simply degrade utility on benign multi-turn conversations.

Authors: We agree that §5 requires more complete information on the defense evaluation. In the revision, we will detail the defense mechanism as a cumulative risk monitoring approach that computes an aggregated risk score across turns and intervenes (via rewriting or refusal) above a threshold. The evaluation dataset will be specified as 150 Salami Attack instances plus 100 from other multi-turn techniques. We will also add an ablation study and utility assessment on 200 benign multi-turn conversations, showing that the defense maintains response quality and helpfulness with no significant degradation per human ratings and automated metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivations or self-referential reductions

full rationale

The paper proposes the Salami Slicing Risk concept and Salami Attack framework as an empirical approach to multi-turn jailbreaking, validated through experiments reporting >90% ASR on models like GPT-4o. No equations, fitted parameters, or mathematical derivations are present in the provided text. The central claims rest on experimental outcomes rather than any chain that reduces by construction to its own inputs or self-citations. This is a standard empirical security paper whose results are externally falsifiable via replication on the tested models and defenses; no load-bearing step collapses into tautology or renaming of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical observation that low-risk prompts can accumulate intent without triggering filters; no free parameters, mathematical axioms, or new physical entities are introduced.

invented entities (1)

Salami Slicing Risk no independent evidence
purpose: Conceptual model for cumulative harmful intent built from individually safe inputs
New term introduced to describe the attack surface

pith-pipeline@v0.9.0 · 5622 in / 1049 out tokens · 42225 ms · 2026-05-10T16:02:13.443702+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 46 canonical work pages · 14 internal anchors

[1]

A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

K. Wang, G. Zhang, Z. Zhou, J. Wu, M. Yu, S. Zhao, C. Yin, J. Fu, Y . Yan, H. Luoet al., “A comprehensive survey in llm (- agent) full stack safety: Data, training and deployment,”arXiv preprint arXiv:2504.15585, 2025

work page arXiv 2025
[2]

Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295, 2024

S. Yi, Y . Liu, Z. Sun, T. Cong, X. He, J. Song, K. Xu, and Q. Li, “Jail- break attacks and defenses against large language models: A survey,” arXiv preprint arXiv:2407.04295, 2024

work page arXiv 2024
[3]

Jailbroken: How does llm safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in Neural Information Processing Sys- tems, vol. 36, pp. 80 079–80 110, 2023

2023
[4]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models,”arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review arXiv 2023
[5]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Boosting jailbreak attack with momentum,

Y . Zhang and Z. Wei, “Boosting jailbreak attack with momentum,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[7]

LLM jailbreak attack versus defense techniques -- a comprehensive study

Z. Xu, Y . Liu, G. Deng, Y . Li, and S. Picek, “A comprehensive study of jailbreak attack versus defense for large language models,”arXiv preprint arXiv:2402.13457, 2024

work page arXiv 2024
[8]

Emerging vulnerabilities in frontier models: Multi-turn jailbreak attacks, 2024

T. Gibbs, E. Kosak-Hine, G. Ingebretsen, J. Zhang, J. Broomfield, S. Pieri, R. Iranmanesh, R. Rabbany, and K. Pelrine, “Emerging vulner- abilities in frontier models: Multi-turn jailbreak attacks,”arXiv preprint arXiv:2409.00137, 2024. SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 14

work page arXiv 2024
[9]

Great, now write an article about that: The crescendo multi-turn llm jailbreak attack

M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The crescendo multi-turn llm jailbreak attack,”arXiv preprint arXiv:2404.01833, vol. 2, no. 6, p. 17, 2024

work page arXiv 2024
[10]

Leveraging the context through multi-round interactions for jailbreaking attacks,

Y . Cheng, M. Georgopoulos, V . Cevher, and G. G. Chrysos, “Leveraging the context through multi-round interactions for jailbreaking attacks,” arXiv preprint arXiv:2402.09177, 2024

work page arXiv 2024
[11]

Multi-turn context jailbreak attack on large language models from first principles,

X. Sun, D. Zhang, D. Yang, Q. Zou, and H. Li, “Multi-turn context jailbreak attack on large language models from first principles,”arXiv preprint arXiv:2408.04686, 2024

work page arXiv 2024
[12]

Foot-in-the-door: A multi-turn jailbreak for llms

Z. Weng, X. Jin, J. Jia, and X. Zhang, “Foot-in-the-door: A multi-turn jailbreak for llms,”arXiv preprint arXiv:2502.19820, 2025

work page arXiv 2025
[13]

Safe+ safe= unsafe? exploring how safe images can be exploited to jailbreak large vision-language models,

C. Cui, G. Deng, A. Zhang, J. Zheng, Y . Li, L. Gao, T. Zhang, and T.-S. Chua, “Safe+ safe= unsafe? exploring how safe images can be exploited to jailbreak large vision-language models,”arXiv preprint arXiv:2411.11496, 2024

work page arXiv 2024
[14]

V oice jailbreak attacks against gpt-4o,

X. Shen, Y . Wu, M. Backes, and Y . Zhang, “V oice jailbreak attacks against gpt-4o,”arXiv preprint arXiv:2405.19103, 2024

work page arXiv 2024
[15]

A representation engineering perspective on the effectiveness of multi-turn jailbreaks.arXiv preprint arXiv:2507.02956, 2025

B. Bullwinkel, M. Russinovich, A. Salem, S. Zanella-Beguelin, D. Jones, G. Severi, E. Kim, K. Hines, A. Minnich, Y . Zungeret al., “A representation engineering perspective on the effectiveness of multi-turn jailbreaks,”arXiv preprint arXiv:2507.02956, 2025

work page arXiv 2025
[16]

arXiv preprint arXiv:2309.01446 , year=

R. Lapid, R. Langberg, and M. Sipper, “Open sesame! universal black box jailbreaking of large language models,”arXiv preprint arXiv:2309.01446, 2023

work page arXiv 2023
[17]

Llm defenses are not robust to multi-turn human jailbreaks yet

N. Li, Z. Han, I. Steneker, W. Primack, R. Goodside, H. Zhang, Z. Wang, C. Menghini, and S. Yue, “Llm defenses are not robust to multi-turn human jailbreaks yet,”arXiv preprint arXiv:2408.15221, 2024

work page arXiv 2024
[18]

” do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1671–1685

2024
[19]

Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,

J. Yu, X. Lin, Z. Yu, and X. Xing, “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,”arXiv preprint arXiv:2309.10253, 2023

work page arXiv 2023
[20]

Tree of attacks: Jailbreaking black-box llms automatically,

A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y . Singer, and A. Karbasi, “Tree of attacks: Jailbreaking black-box llms automatically,”Advances in Neural Information Processing Systems, vol. 37, pp. 61 065–61 105, 2024

2024
[21]

Ethical and social risks of harm from Language Models

L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadehet al., “Ethical and social risks of harm from language models,”arXiv preprint arXiv:2112.04359, 2021

work page internal anchor Pith review arXiv 2021
[22]

Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054, 1, 2025

Z. Ying, D. Zhang, Z. Jing, Y . Xiao, Q. Zou, A. Liu, S. Liang, X. Zhang, X. Liu, and D. Tao, “Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models,”arXiv preprint arXiv:2502.11054, 2025

work page arXiv 2025
[23]

Siege: Autonomous multi-turn jailbreaking of large language models with tree search

A. Zhou and R. Arel, “Tempest: Autonomous multi-turn jailbreak- ing of large language models with tree search,”arXiv preprint arXiv:2503.10619, 2025

work page arXiv 2025
[24]

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel

S. Rahman, L. Jiang, J. Shiffer, G. Liu, S. Issaka, M. R. Parvez, H. Palangi, K.-W. Chang, Y . Choi, and S. Gabriel, “X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents,”arXiv preprint arXiv:2504.13203, 2025

work page arXiv 2025
[25]

Automated red teaming with goat: the generative offensive agent tester,

M. Pavlova, E. Brinkman, K. Iyer, V . Albiero, J. Bitton, H. Nguyen, C. C. Ferrer, I. Evtimov, and A. Grattafiori, “Automated red teaming with goat: the generative offensive agent tester,” inICLR 2025 Workshop on Building Trust in Language Models and Applications

2025
[26]

Visual contextual attack: Jailbreaking mllms with image-driven context injection,

Z. Miao, Y . Ding, L. Li, and J. Shao, “Visual contextual attack: Jailbreaking mllms with image-driven context injection,”arXiv preprint arXiv:2507.02844, 2025

work page arXiv 2025
[27]

Viscra: A visual chain reasoning attack for jailbreaking multimodal large language models,

B. Sima, L. Cong, W. Wang, and K. He, “Viscra: A visual chain reasoning attack for jailbreaking multimodal large language models,” arXiv preprint arXiv:2505.19684, 2025

work page arXiv 2025
[28]

Steering dialogue dynamics for robust- ness against multi-turn jailbreaking attacks.arXiv preprint arXiv:2503.00187,

H. Hu, A. Robey, and C. Liu, “Steering dialogue dynamics for robustness against multi-turn jailbreaking attacks,”arXiv preprint arXiv:2503.00187, 2025

work page arXiv 2025
[29]

An attention-aware gnn-based input defender against multi-turn jailbreak on llms,

Z. Huang, K. Huang, L. Yin, B. He, H. Zhen, M. Yuan, and Z. Shao, “An attention-aware gnn-based input defender against multi-turn jailbreak on llms,”arXiv preprint arXiv:2507.07146, 2025

work page arXiv 2025
[30]

Gpt-4 jailbreaks itself with near-perfect success using self-explanation,

G. Ramesh, Y . Dou, and W. Xu, “Gpt-4 jailbreaks itself with near-perfect success using self-explanation,”arXiv preprint arXiv:2405.13077, 2024

work page arXiv 2024
[31]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowskiet al., “Representation en- gineering: A top-down approach to ai transparency,”arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review arXiv 2023
[32]

Jailbreaking Black Box Large Language Models in Twenty Queries

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” 2024. [Online]. Available: https://arxiv.org/abs/2310.08419

work page internal anchor Pith review arXiv 2024
[33]

Many-shot jailbreaking,

C. Anil, E. Durmus, N. Panickssery, M. Sharma, J. Benton, S. Kundu, J. Batson, M. Tong, J. Mu, D. Fordet al., “Many-shot jailbreaking,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 129 696– 129 742, 2024

2024
[34]

Refusal in language models is mediated by a single direction,

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction,”Advances in Neural Information Processing Systems, vol. 37, pp. 136 037–136 083, 2024

2024
[35]

Don’t say no: Jail- breaking llm by suppressing refusal,

Y . Zhou, Z. Huang, F. Lu, Z. Qin, and W. Wang, “Don’t say no: Jail- breaking llm by suppressing refusal,”arXiv preprint arXiv:2404.16369, 2024

work page arXiv 2024
[36]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggineet al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,”arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review arXiv 2023
[38]

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

M. Sharma, M. Tong, J. Mu, J. Wei, J. Kruthoff, S. Goodfriend, E. Ong, A. Peng, R. Agarwal, C. Anilet al., “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,”arXiv preprint arXiv:2501.18837, 2025

work page arXiv 2025
[39]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[41]

I’m unable to assist

Z. Zhou, H. Yu, X. Zhang, R. Xu, F. Huang, K. Wang, Y . Liu, J. Fang, and Y . Li, “On the role of attention heads in large language model safety,”arXiv preprint arXiv:2410.13708, 2024

work page arXiv 2024
[42]

Adversarial representation engineering: A general model editing framework for large language models,

Y . Zhang, Z. Wei, J. Sun, and M. Sun, “Adversarial representation engineering: A general model editing framework for large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 243–126 264, 2024

2024
[43]

Jailbreak- ing prompt attack: A controllable adversarial attack against diffusion models,

J. Ma, Y . Li, Z. Xiao, A. Cao, J. Zhang, C. Ye, and J. Zhao, “Jailbreak- ing prompt attack: A controllable adversarial attack against diffusion models,”arXiv preprint arXiv:2404.02928, 2024

work page arXiv 2024
[44]

Jailbreaking text-to-image models with llm-based agents,

Y . Dong, Z. Li, X. Meng, N. Yu, and S. Guo, “Jailbreaking text-to-image models with llm-based agents,”arXiv preprint arXiv:2408.00523, 2024

work page arXiv 2024
[45]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large lan- guage and vision-language models,

H. Jin, L. Hu, X. Li, P. Zhang, C. Chen, J. Zhuang, and H. Wang, “Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large lan- guage and vision-language models,”arXiv preprint arXiv:2407.01599, 2024

work page arXiv 2024
[46]

Visual adversarial examples jailbreak aligned large language models,

X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Visual adversarial examples jailbreak aligned large language models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 19, 2024, pp. 21 527–21 536

2024
[47]

doi:10.48550/arXiv.2410.05295 , abstract =

X. Liu, P. Li, E. Suh, Y . V orobeychik, Z. Mao, S. Jha, P. McDaniel, H. Sun, B. Li, and C. Xiao, “Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms,”arXiv preprint arXiv:2410.05295, 2024

work page arXiv 2024
[48]

Safe in isolation, dangerous together: Agent-driven multi-turn decomposition jailbreaks on llms,

D. Srivastav and X. Zhang, “Safe in isolation, dangerous together: Agent-driven multi-turn decomposition jailbreaks on llms,” inProceed- ings of the 1st Workshop for Research on Agent Language Models (REALM 2025), 2025, pp. 170–183

2025
[49]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,”arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review arXiv 2024
[50]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models,

P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Trameret al., “Jailbreakbench: An open robustness benchmark for jailbreaking large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 55 005–55 029, 2024

2024
[51]

Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

F. Liu, Y . Feng, Z. Xu, L. Su, X. Ma, D. Yin, and H. Liu, “Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework,”arXiv preprint arXiv:2410.12855, 2024

work page arXiv 2024
[52]

Easyjailbreak: A unified framework for jailbreaking large language models,

W. Zhou, X. Wang, L. Xiong, H. Xia, Y . Gu, M. Chai, F. Zhu, C. Huang, S. Dou, Z. Xi, R. Zheng, S. Gao, Y . Zou, H. Yan, Y . Le, R. Wang, L. Li, J. Shao, T. Gui, Q. Zhang, and X. Huang, “Easyjailbreak: A unified framework for jailbreaking large language models,” 2024. SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 15

2024
[53]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhaoet al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,”arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review arXiv 2024
[57]

Query-relevant images jailbreak large multi-modal models,

X. Liu, Y . Zhu, Y . Lan, C. Yang, and Y . Qiao, “Query-relevant images jailbreak large multi-modal models,” 2023

2023
[58]

Jailbreaking multimodal large language models via shuffle inconsistency,

S. Zhao, R. Duan, F. Wang, C. Chen, C. Kang, S. Ruan, J. Tao, Y . Chen, H. Xue, and X. Wei, “Jailbreaking multimodal large language models via shuffle inconsistency,”arXiv preprint arXiv:2501.04931, 2025

work page arXiv 2025
[59]

Overt: A benchmark for over-refusal evaluation on text-to- image models,

Z. Cheng, Y . Huang, H. Xu, S. Sojoudi, X. Zhao, D. Song, and S. Mei, “Overt: A benchmark for over-refusal evaluation on text-to- image models,”arXiv preprint arXiv:2505.21347, 2025

work page arXiv 2025
[60]

Figstep: Jailbreaking large vision-language models via typographic visual prompts,

Y . Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 22, 2025, pp. 23 951–23 959

2025
[61]

Sneakyprompt: Jailbreaking text-to-image generative models,

Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao, “Sneakyprompt: Jailbreaking text-to-image generative models,” in2024 IEEE symposium on security and privacy (SP). IEEE, 2024, pp. 897–912

2024
[62]

Harnessing llm to attack llm-guarded text-to- image models,

Y . Deng and H. Chen, “Harnessing llm to attack llm-guarded text-to- image models,”arXiv e-prints, pp. arXiv–2312, 2023

2023
[63]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,”arXiv preprint arXiv:2310.03684, 2023

work page internal anchor Pith review arXiv 2023
[64]

Defending chatgpt against jailbreak attack via self-reminder,

F. Wu, Y . Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, and X. Xie, “Defending chatgpt against jailbreak attack via self-reminder,” 2023

2023
[65]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

2023