pith. machine review for the scientific record. sign in

arxiv: 2605.05058 · v1 · submitted 2026-05-06 · 💻 cs.CR · cs.AI

Recognition: unknown

SoK: Robustness in Large Language Models against Jailbreak Attacks

Bin Benjamin Zhu, Chaoxiang He, Dawu Gu, Feiyue Xu, Hanqing Hu, Hongsheng Hu, Sheng Hang, Shi-Feng Sun, Shuo Wang, Xiuming Liu, Yubo Zhao, Zhengyan Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:38 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreak attacksLLM robustnessSecurity Cubeadversarial promptssystematization of knowledgeAI safety evaluation
0
0 comments X

The pith

Security Cube supplies a multi-dimensional framework that benchmarks 13 jailbreak attacks and 5 defenses to map LLM vulnerabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes existing work on jailbreak attacks that force large language models to produce disallowed outputs. It introduces Security Cube as a unified evaluation structure with multiple dimensions to replace narrow measures such as attack success rate. Benchmarking 13 representative attacks and 5 defenses produces a landscape view that includes automated judges and model weaknesses. Readers interested in AI safety care because clearer evaluation methods can guide the creation of more reliable safeguards.

Core claim

The central claim is that a systematic taxonomy combined with the Security Cube framework enables comprehensive benchmarking of jailbreak techniques and defenses, exposing current limitations and open problems in making large language models robust.

What carries the argument

Security Cube, a unified multi-dimensional framework that evaluates jailbreak attacks and defenses across several security aspects simultaneously.

Load-bearing premise

The chosen 13 attacks, 5 defenses, and the dimensions defined in Security Cube are representative enough to capture the full range of LLM security issues.

What would settle it

A previously unknown jailbreak attack that evades all five defenses while scoring low on every dimension of Security Cube, or a defense that blocks attacks across all dimensions with no measurable drop in model utility.

Figures

Figures reproduced from arXiv: 2605.05058 by Bin Benjamin Zhu, Chaoxiang He, Dawu Gu, Feiyue Xu, Hanqing Hu, Hongsheng Hu, Sheng Hang, Shi-Feng Sun, Shuo Wang, Xiuming Liu, Yubo Zhao, Zhengyan Zhou.

Figure 1
Figure 1. Figure 1: Overview of the Security Cube pipeline. Given a jailbreak goal, the attacker generates an initial adversarial prompt using a specific attack method (e.g., shuffling, LLM-based generation, or template rewriting). The target model, protected by a defense mechanism such as system prompts, pre-/post-guardrails, or other safety layers, produces a response. The attacker iteratively refines the prompt based on de… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Judge methods: Rule-based judge (Rule), binary Prompt-based (Binary Judge) (Safe), Prompt-based view at source ↗
Figure 3
Figure 3. Figure 3: Utility change of each defense, measured as view at source ↗
Figure 4
Figure 4. Figure 4: Cosine similarity between successful and failed view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of successful attack prompt view at source ↗
read the original abstract

Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack success rate that fail to capture the multidimensional nature of LLM security. In this paper, we present a systematic taxonomy of jailbreak attacks and defenses and introduce Security Cube, a unified, multi-dimensional framework for comprehensive evaluation of these techniques. We provide detailed comparison tables of existing attacks and defenses, highlighting key insights and open challenges across the literature. Leveraging Security Cube, we conduct benchmark studies on 13 representative attacks and 5 defenses, establishing a clear view of the current landscape encompassing jailbreak attacks, defenses, automated judges, and LLM vulnerabilities. Based on these evaluations, we distill critical findings, identify unresolved problems, and outline promising research directions for enhancing LLM robustness against jailbreak attacks. Our analysis aims to pave the way towards more robust, interpretable, and trustworthy LLM systems. Our code is available at Code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a systematization of knowledge (SoK) on jailbreak attacks against LLMs, introduces a taxonomy of attacks and defenses, proposes the Security Cube as a multi-dimensional evaluation framework, provides comparison tables of existing methods, and reports benchmark results on 13 representative attacks and 5 defenses to map the current landscape, distill findings on vulnerabilities and automated judges, and suggest research directions for improved robustness.

Significance. If the benchmarks hold, this work offers a valuable unified framework and landscape overview that could standardize evaluations beyond single-metric attack success rates, helping the field identify gaps in LLM safety; the open code and explicit taxonomy are strengths for reproducibility and future work.

major comments (1)
  1. [Benchmark studies] Benchmark studies section (as described in abstract): the central claim that the selected 13 attacks and 5 defenses, evaluated under Security Cube, establish a 'clear view of the current landscape' is load-bearing but rests on the unelaborated assertion that they are 'representative'; without explicit selection criteria or coverage analysis (e.g., across gradient-based vs. prompt-engineering attacks, alignment vs. detection defenses, or model families), the distilled findings and research directions risk incompleteness or selection bias, directly undermining the comprehensiveness asserted in the abstract.
minor comments (2)
  1. [Abstract] Abstract: the code availability statement ('Our code is available at Code.') is incomplete; provide the actual repository URL.
  2. Throughout: ensure all comparison tables explicitly reference the Security Cube dimensions used for each entry to improve traceability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript accordingly to strengthen the presentation of our benchmark studies.

read point-by-point responses
  1. Referee: [Benchmark studies] Benchmark studies section (as described in abstract): the central claim that the selected 13 attacks and 5 defenses, evaluated under Security Cube, establish a 'clear view of the current landscape' is load-bearing but rests on the unelaborated assertion that they are 'representative'; without explicit selection criteria or coverage analysis (e.g., across gradient-based vs. prompt-engineering attacks, alignment vs. detection defenses, or model families), the distilled findings and research directions risk incompleteness or selection bias, directly undermining the comprehensiveness asserted in the abstract.

    Authors: We agree that the selection of the 13 attacks and 5 defenses requires explicit justification to support the claim of a representative landscape view. While our taxonomy (Section 3) and comparison tables (Sections 4 and 5) already organize methods by core dimensions, the benchmark section does not sufficiently articulate the selection process. In the revised manuscript we will add a dedicated paragraph (and supporting table) in the Benchmark Studies section that states the criteria: (1) coverage of all major taxonomy branches (prompt-engineering vs. optimization/gradient-based attacks; alignment-based vs. detection-based defenses); (2) inclusion of both closed-source (GPT family) and open-source (Llama, Vicuna) model families; (3) prominence and recency in the literature. We will also include a coverage matrix showing how the chosen methods map onto the taxonomy axes, thereby demonstrating breadth and mitigating concerns of selection bias. These additions will be placed before the results tables so that the distilled findings rest on a transparent foundation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the SoK taxonomy, Security Cube framework, or benchmarks

full rationale

This is a systematization-of-knowledge paper that surveys existing jailbreak literature, proposes an independent multi-dimensional evaluation framework (Security Cube), and runs benchmarks on 13 attacks plus 5 defenses drawn from prior work. No mathematical derivations, parameter-fitting steps, or predictions appear in the provided text. The central claims rest on explicit selection of representative methods and construction of comparison tables rather than any self-referential definition, fitted-input-as-prediction, or load-bearing self-citation chain. The framework and taxonomy are presented as new organizing structures built on external literature, not as reductions of the paper's own inputs. Selection of the 13+5 methods is a representativeness judgment, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces a new evaluation framework but contains no mathematical free parameters, axioms, or invented physical entities; the framework itself is the primary addition.

invented entities (1)
  • Security Cube no independent evidence
    purpose: Unified multi-dimensional framework for comprehensive evaluation of jailbreak attacks and defenses
    Presented as a new construct in the paper to address limitations of single-metric evaluations.

pith-pipeline@v0.9.0 · 5558 in / 1191 out tokens · 26036 ms · 2026-05-08T16:38:39.510275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

112 extracted references · 40 canonical work pages · 9 internal anchors

  1. [1]

    Detecting language model attacks with perplexity

    Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

  2. [2]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammar- ion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. InICLR, 2025

  3. [3]

    Claude 3.7 sonnet system card

    Anthropic. Claude 3.7 sonnet system card. https://www.anthropic.com/news/claude-3-7-sonnet, 2025

  4. [4]

    Mt- bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt- bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InACL, pages 7421–7454, 2024

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmless- ness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  6. [6]

    Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

    Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul R ¨ottger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. InICLR, 2024

  7. [7]

    A realistic threat model for large language model jailbreaks.arXiv preprint arXiv:2410.16222, 2024

    Valentyn Boreiko, Alexander Panfilov, Vaclav V oracek, Matthias Hein, and Jonas Geiping. A realistic threat model for large language model jailbreaks.arXiv preprint arXiv:2410.16222, 2024

  8. [8]

    Baryannis, G., Validi, S., Dani, S., and Antoniou, G

    Hannah Brown, Leon Lin, Kenji Kawaguchi, and Michael Shieh. Self-evaluation as a defense against adversarial attacks on llms. ArXiv, abs/2407.03234, 2024

  9. [9]

    Safedialbench: A fine-grained safety benchmark for large language models in multi-turn dialogues with diverse jailbreak attacks.ArXiv, abs/2502.11090, 2025

    Hongye Cao, Yanming Wang, Sijia Jing, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Boyan Wang, Jiaheng Liu, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, and Junlan Feng. Safedialbench: A fine-grained safety benchmark for large language models in multi-turn dialogues with diverse jailbreak attacks.ArXiv, abs/2502.11090, 2025

  10. [10]

    Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios.Journal of medical systems, 47(1):33, 2023

    Marco Cascella, Jonathan Montomoli, Valentina Bellini, and Elena Bignami. Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios.Journal of medical systems, 47(1):33, 2023

  11. [11]

    Pappas, Florian Tram `er, Hamed Hassani, and Eric Wong

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram `er, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.NIPS, 37:55005– 55029, 2024

  12. [12]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. InSaTML, pages 23–42. IEEE, 2025

  13. [13]

    Comprehensive assessment of jailbreak attacks against LLMs

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Comprehensive assessment of jailbreak attacks against llms.arXiv preprint arXiv:2402.05668, 2024

  14. [14]

    Safe rlhf: Safe reinforcement learning from human feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InICLR, 2024

  15. [15]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  16. [16]

    Multilingual jailbreak challenges in large language models

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models. InICLR, 2024

  17. [17]

    Garak: A framework for security probing of large language models

    Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A framework for security probing large language models.arXiv preprint arXiv:2406.11036, 2024

  18. [18]

    A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily

    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. In NAACL, pages 2136–2153, 2024

  19. [19]

    A compre- hensive survey of attack techniques, implementation, and mitigation strategies in large language models

    Aysan Esmradi, Daniel Wankit Yip, and Chun Fai Chan. A compre- hensive survey of attack techniques, implementation, and mitigation strategies in large language models. InUbiSec, 2023

  20. [20]

    Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971

    Joseph L Fleiss. Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971

  21. [21]

    Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models

    Lang Gao, Xiangliang Zhang, Preslav Nakov, and Xiuying Chen. Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models. InACL, 2025

  22. [22]

    Gemini-2.0-flash documents

    Google. Gemini-2.0-flash documents. https://modelcards.withgoogle.com/assets/documents/gemini-2- flash.pdf, 2025

  23. [23]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    Melody Y . Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models, 2025

  25. [25]

    Cold-attack: Jailbreaking llms with stealthiness and controllability

    Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. Cold-attack: Jailbreaking llms with stealthiness and controllability. InICML, pages 16974–17002. PMLR, 2024

  26. [26]

    Columbia university, 1997

    Orris C Herfindahl.Concentration in the steel industry. Columbia university, 1997

  27. [27]

    Dualbreach: Efficient dual-jailbreaking via target-driven initialization and multi-target op- timization.ArXiv, abs/2504.18564, 2025

    Xinzhe Huang, Kedong Xiu, Tianhang Zheng, Churui Zeng, Wangze Ni, Zhan Qiin, Kui Ren, and Chun Chen. Dualbreach: Efficient dual-jailbreaking via target-driven initialization and multi-target op- timization.ArXiv, abs/2504.18564, 2025

  28. [28]

    Best-of-NJailbreaking

    John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556, 2024

  29. [29]

    Llama guard: Llm- based input-output safeguard for human-ai conversations, 2023

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm- based input-output safeguard for human-ai conversations, 2023

  30. [30]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beu- tel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 15

  31. [31]

    Aligner: Efficient alignment by learning to correct

    Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Tianyi Qiu, Juntao Dai, and Yaodong Yang. Aligner: Efficient alignment by learning to correct. InNIPS, 2024

  32. [32]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023

  33. [33]

    Artprompt: Ascii art-based jailbreak attacks against aligned llms

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. Artprompt: Ascii art-based jailbreak attacks against aligned llms. InACL, 2024

  34. [34]

    Chatgpt for good? on opportunities and challenges of large language models for education

    Enkelejda Kasneci, Kathrin Seßler, Stefan K ¨uchemann, Maria Ban- nert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan G¨unnemann, Eyke H ¨ullermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023

  35. [35]

    h4rm3l: A dynamic benchmark of composable jailbreak attacks for llm safety assessment.arXiv e- prints, pages arXiv–2408, 2024

    Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D Manning. h4rm3l: A dynamic benchmark of composable jailbreak attacks for llm safety assessment.arXiv e- prints, pages arXiv–2408, 2024

  36. [36]

    Certifying llm safety against adversarial prompting

    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying llm safety against adversarial prompting. InCOLM, 2024

  37. [37]

    Jailpo: A novel black-box jailbreak framework via preference optimization against aligned llms

    Hongyi Li, Jiawei Ye, Jie Wu, Tianjie Yan, Chu Wang, and Zhixin Li. Jailpo: A novel black-box jailbreak framework via preference optimization against aligned llms. InAAAI, pages 27419–27427, 2025

  38. [38]

    One model transfer to all: On robust jailbreak prompts generation against llms

    Linbao Li, Yannan Liu, Daojing He, and Yu Li. One model transfer to all: On robust jailbreak prompts generation against llms. InICLR, 2025

  39. [39]

    Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers

    Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers. InEMNLP, 2024

  40. [40]

    Deepincep- tion: Hypnotize large language model to be jailbreaker

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191, 2023

  41. [41]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gul- rajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca eval, 5 2023

  42. [42]

    Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms

    Zeyi Liao and Huan Sun. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. InCOLM, 2024

  43. [43]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  44. [44]

    Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

    Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, and Hao Liu. Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.ArXiv, abs/2410.12855, 2024

  45. [45]

    Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao

    Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. AutoDAN-turbo: A lifelong agent for strategy self-exploration to jailbreak LLMs. InICLR, 2025

  46. [46]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. InICLR, 2024

  47. [47]

    Auto- rt: Automatic jailbreak strategy exploration for red-teaming large language models.ArXiv, abs/2501.01830, 2025

    Yanjiang Liu, Shuheng Zhou, Yaojie Lu, Huijia Zhu, Weiqiang Wang, Hongyu Lin, Ben He, Xianpei Han, and Le Sun. Auto- rt: Automatic jailbreak strategy exploration for red-teaming large language models.ArXiv, abs/2501.01830, 2025

  48. [48]

    doi:10.48550/arXiv.2410.02832 , abstract =

    Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv preprint arXiv:2410.02832, 2024

  49. [49]

    Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

    Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model safety. arXiv preprint arXiv:2502.05206, 2025

  50. [50]

    You can’t eat your cake and have it too: The performance degradation of llms with jailbreak defense

    Wuyuao Mai, Geng Hong, Pei Chen, Xudong Pan, Baojun Liu, Yuan Zhang, Haixin Duan, and Min Yang. You can’t eat your cake and have it too: The performance degradation of llms with jailbreak defense. InWWW, pages 872–883, 2025

  51. [51]

    Harmbench: A standardized evaluation framework for auto- mated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for auto- mated red teaming and robust refusal. InICML, pages 35181–35224. PMLR, 2024

  52. [52]

    Tree of attacks: Jailbreaking black-box llms automatically.NIPS, 37:61065– 61105, 2024

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nel- son, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically.NIPS, 37:61065– 61105, 2024

  53. [53]

    Fight back against jailbreaking via prompt adversarial tuning

    Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. InNIPS, 2024

  54. [54]

    Sg-bench: Evaluating llm safety generalization across diverse tasks and prompt types.NIPS, 37:123032–123054, 2024

    Yutao Mou, Shikun Zhang, and Wei Ye. Sg-bench: Evaluating llm safety generalization across diverse tasks and prompt types.NIPS, 37:123032–123054, 2024

  55. [55]

    Accessed: 2024-02- 25

    Zvi Mowshowitz., 2022. Accessed: 2024-02- 25

  56. [56]

    and Minnich, Amanda J

    Gary D Lopez Munoz, Amanda J Minnich, Roman Lutz, Richard Lundeen, Raja Sekhar Rao Dheekonda, Nina Chikanov, Bolor- Erdene Jagdagdorj, Martin Pouliot, Shiven Chawla, Whitney Maxwell, et al. Pyrit: A framework for security risk identifi- cation and red teaming in generative ai system.arXiv preprint arXiv:2410.02828, 2024

  57. [57]

    arXiv preprint arXiv:2402.02309 , year=

    Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

  58. [58]

    gpt-3.5-turbo system card

    OpenAI. gpt-3.5-turbo system card. https://platform.openai.com/docs/models/gpt-3.5-turbo, 2023

  59. [59]

    How we think about safety and alignment

    OpenAI. How we think about safety and alignment. https://openai.com/safety/how-we-think-about-safety-alignment/, 2025

  60. [60]

    Training language models to follow instruc- tions with human feedback.NIPS, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instruc- tions with human feedback.NIPS, 35:27730–27744, 2022

  61. [61]

    Jailbreaking and mitigation of vulnerabilities in large language models.arXiv preprint arXiv:2410.15236, 2024

    Benji Peng, Ziqian Bi, Qian Niu, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence KQ Yan, Yizhu Wen, Yichao Zhang, and Cait- lyn Heqi Yin. Jailbreaking and mitigation of vulnerabilities in large language models.arXiv preprint arXiv:2410.15236, 2024

  62. [62]

    Safety alignment should be made more than just a few tokens deep

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In ICLR, 2025

  63. [63]

    Qwen-2.5-max technical report

    Qwen. Qwen-2.5-max technical report. https://www.researchgate.net/publication/388485555 Technical Report on Qwen-25 Max, 2025

  64. [64]

    Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models

    Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, and Anyu Wang. Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models. ArXiv, abs/2406.09321, 2024

  65. [65]

    Tricking llms into disobedience: Formalizing, analyzing, and detecting jailbreaks

    Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking llms into disobedience: Formalizing, analyzing, and detecting jailbreaks. InLREC, 2023. 16

  66. [66]

    Codeattack: Revealing safety generalization challenges of large language models via code completion

    Qibing Ren, Chang Gao, Jing Shao, Junchi Yan, Xin Tan, Wai Lam, and Lizhuang Ma. Codeattack: Revealing safety generalization challenges of large language models via code completion. InACL, pages 11437–11452, 2024

  67. [67]

    LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts.ACL, 2025

    Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts.ACL, 2025

  68. [68]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pap- pas. Smoothllm: Defending large language models against jailbreak- ing attacks.ArXiv, abs/2310.03684, 2023

  69. [69]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

  70. [70]

    Great, now write an article about that: The crescendo multi-turn llm jailbreak attack

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. InUSENIX Security, pages 2421–2440, 2025

  71. [71]

    Pandaguard: Systematic evaluation of llm safety against jailbreaking attacks.arXiv preprint arXiv:2505.13862, 2025

    Guobin Shen, Dongcheng Zhao, Linghao Feng, Xiang He, Jihang Wang, Sicheng Shen, Haibo Tong, Yiting Dong, Jindong Li, Xiang Zheng, et al. Pandaguard: Systematic evaluation of llm safety against jailbreaking attacks.arXiv preprint arXiv:2505.13862, 2025

  72. [72]

    ” do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ” do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models. InACM SIGSAC, pages 1671–1685, 2024

  73. [73]

    Pal: Proxy-guided black-box attack on large language models,

    Chawin Sitawarin, Norman Mu, David Wagner, and Alexandre Araujo. Pal: Proxy-guided black-box attack on large language models.ArXiv, abs/2402.09674, 2024

  74. [74]

    Trustllm: Trustworthiness in large language models

    Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large language models. InICML, pages 20166–20270, 2024

  75. [75]

    Angel, and Arul Jothi

    Sadaf Surur, J. Angel, and Arul Jothi. Jailbreak attacks on large language models and possible defenses: Present status and future possibilities.ISTAS, pages 1–7, 2024

  76. [76]

    Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.NIPS, 36:31232–31339, 2023

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.NIPS, 36:31232–31339, 2023

  77. [77]

    Mrj-agent: An effective jailbreak agent for multi-round dialogue.arXiv preprint arXiv:2411.03814, 2024

    Fengxiang Wang, Ranjie Duan, Peng Xiao, Xiaojun Jia, Shiji Zhao, Cheng Wei, YueFeng Chen, Chongwen Wang, Jialing Tao, Hang Su, et al. Mrj-agent: An effective jailbreak agent for multi-round dialogue.arXiv preprint arXiv:2411.03814, 2024

  78. [78]

    Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InEMNLP, pages 8696–8708. Asso- ciation for Computational Linguistics (ACL), 2021

  79. [79]

    Exploring the potential of large language models for automation in technical customer service.arXiv preprint arXiv:2405.09161, 2024

    Jochen Wulf and J ¨urg Meierhofer. Exploring the potential of large language models for automation in technical customer service.arXiv preprint arXiv:2405.09161, 2024

  80. [80]

    Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelli- gence, 5:1486–1496, 2023

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelli- gence, 5:1486–1496, 2023

Showing first 80 references.