arxiv: 2605.05058 · v1 · submitted 2026-05-06 · 💻 cs.CR · cs.AI

Recognition: unknown

SoK: Robustness in Large Language Models against Jailbreak Attacks

Bin Benjamin Zhu, Chaoxiang He, Dawu Gu, Feiyue Xu, Hanqing Hu, Hongsheng Hu, Sheng Hang, Shi-Feng Sun, Shuo Wang, Xiuming Liu, Yubo Zhao, Zhengyan Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:38 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreak attacksLLM robustnessSecurity Cubeadversarial promptssystematization of knowledgeAI safety evaluation

0 comments

The pith

Security Cube supplies a multi-dimensional framework that benchmarks 13 jailbreak attacks and 5 defenses to map LLM vulnerabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes existing work on jailbreak attacks that force large language models to produce disallowed outputs. It introduces Security Cube as a unified evaluation structure with multiple dimensions to replace narrow measures such as attack success rate. Benchmarking 13 representative attacks and 5 defenses produces a landscape view that includes automated judges and model weaknesses. Readers interested in AI safety care because clearer evaluation methods can guide the creation of more reliable safeguards.

Core claim

The central claim is that a systematic taxonomy combined with the Security Cube framework enables comprehensive benchmarking of jailbreak techniques and defenses, exposing current limitations and open problems in making large language models robust.

What carries the argument

Security Cube, a unified multi-dimensional framework that evaluates jailbreak attacks and defenses across several security aspects simultaneously.

Load-bearing premise

The chosen 13 attacks, 5 defenses, and the dimensions defined in Security Cube are representative enough to capture the full range of LLM security issues.

What would settle it

A previously unknown jailbreak attack that evades all five defenses while scoring low on every dimension of Security Cube, or a defense that blocks attacks across all dimensions with no measurable drop in model utility.

Figures

Figures reproduced from arXiv: 2605.05058 by Bin Benjamin Zhu, Chaoxiang He, Dawu Gu, Feiyue Xu, Hanqing Hu, Hongsheng Hu, Sheng Hang, Shi-Feng Sun, Shuo Wang, Xiuming Liu, Yubo Zhao, Zhengyan Zhou.

**Figure 1.** Figure 1: Overview of the Security Cube pipeline. Given a jailbreak goal, the attacker generates an initial adversarial prompt using a specific attack method (e.g., shuffling, LLM-based generation, or template rewriting). The target model, protected by a defense mechanism such as system prompts, pre-/post-guardrails, or other safety layers, produces a response. The attacker iteratively refines the prompt based on de… view at source ↗

**Figure 2.** Figure 2: Comparison of Judge methods: Rule-based judge (Rule), binary Prompt-based (Binary Judge) (Safe), Prompt-based view at source ↗

**Figure 3.** Figure 3: Utility change of each defense, measured as view at source ↗

**Figure 4.** Figure 4: Cosine similarity between successful and failed view at source ↗

**Figure 5.** Figure 5: t-SNE visualization of successful attack prompt view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack success rate that fail to capture the multidimensional nature of LLM security. In this paper, we present a systematic taxonomy of jailbreak attacks and defenses and introduce Security Cube, a unified, multi-dimensional framework for comprehensive evaluation of these techniques. We provide detailed comparison tables of existing attacks and defenses, highlighting key insights and open challenges across the literature. Leveraging Security Cube, we conduct benchmark studies on 13 representative attacks and 5 defenses, establishing a clear view of the current landscape encompassing jailbreak attacks, defenses, automated judges, and LLM vulnerabilities. Based on these evaluations, we distill critical findings, identify unresolved problems, and outline promising research directions for enhancing LLM robustness against jailbreak attacks. Our analysis aims to pave the way towards more robust, interpretable, and trustworthy LLM systems. Our code is available at Code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This SoK organizes jailbreak work with a taxonomy and Security Cube framework plus benchmarks on 13 attacks and 5 defenses, but the representativeness of that selection limits how much of a 'clear view' it can claim.

read the letter

The paper's main contribution is pulling the jailbreak literature into one taxonomy and introducing Security Cube as a multi-dimensional evaluation setup. They back it with comparison tables, benchmarks on 13 attacks and 5 defenses, and a list of open problems plus research directions. The code release helps too. That is more concrete than many surveys manage, and it does move past single-metric evaluations like attack success rate alone. Readers get a structured map of attacks, defenses, judges, and vulnerabilities in one place. The distilled findings on current weaknesses are the part that could actually guide follow-up work. The soft spot is the benchmark selection. The abstract calls the 13 attacks and 5 defenses representative, but the paper needs to spell out the coverage criteria in detail—how many gradient-based, prompt-only, or multi-turn attacks are included, which model families, and whether the defenses cover alignment, detection, and other categories evenly. Without that, the claim of establishing a clear landscape view rests on a slice that might miss important gaps. The stress-test concern holds up here; it is not fatal but it caps how definitive the results can be. This paper is for people already working in AI safety who need a consolidated starting point rather than a brand-new method. A reader looking for open problems or a way to compare new ideas against existing ones will get value from the tables and directions. It deserves peer review because the framework and benchmarks are new enough to be worth referee scrutiny, even if the authors will likely need to tighten the justification of their selections and perhaps expand the set in revisions.

Referee Report

1 major / 2 minor

Summary. The paper presents a systematization of knowledge (SoK) on jailbreak attacks against LLMs, introduces a taxonomy of attacks and defenses, proposes the Security Cube as a multi-dimensional evaluation framework, provides comparison tables of existing methods, and reports benchmark results on 13 representative attacks and 5 defenses to map the current landscape, distill findings on vulnerabilities and automated judges, and suggest research directions for improved robustness.

Significance. If the benchmarks hold, this work offers a valuable unified framework and landscape overview that could standardize evaluations beyond single-metric attack success rates, helping the field identify gaps in LLM safety; the open code and explicit taxonomy are strengths for reproducibility and future work.

major comments (1)

[Benchmark studies] Benchmark studies section (as described in abstract): the central claim that the selected 13 attacks and 5 defenses, evaluated under Security Cube, establish a 'clear view of the current landscape' is load-bearing but rests on the unelaborated assertion that they are 'representative'; without explicit selection criteria or coverage analysis (e.g., across gradient-based vs. prompt-engineering attacks, alignment vs. detection defenses, or model families), the distilled findings and research directions risk incompleteness or selection bias, directly undermining the comprehensiveness asserted in the abstract.

minor comments (2)

[Abstract] Abstract: the code availability statement ('Our code is available at Code.') is incomplete; provide the actual repository URL.
Throughout: ensure all comparison tables explicitly reference the Security Cube dimensions used for each entry to improve traceability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript accordingly to strengthen the presentation of our benchmark studies.

read point-by-point responses

Referee: [Benchmark studies] Benchmark studies section (as described in abstract): the central claim that the selected 13 attacks and 5 defenses, evaluated under Security Cube, establish a 'clear view of the current landscape' is load-bearing but rests on the unelaborated assertion that they are 'representative'; without explicit selection criteria or coverage analysis (e.g., across gradient-based vs. prompt-engineering attacks, alignment vs. detection defenses, or model families), the distilled findings and research directions risk incompleteness or selection bias, directly undermining the comprehensiveness asserted in the abstract.

Authors: We agree that the selection of the 13 attacks and 5 defenses requires explicit justification to support the claim of a representative landscape view. While our taxonomy (Section 3) and comparison tables (Sections 4 and 5) already organize methods by core dimensions, the benchmark section does not sufficiently articulate the selection process. In the revised manuscript we will add a dedicated paragraph (and supporting table) in the Benchmark Studies section that states the criteria: (1) coverage of all major taxonomy branches (prompt-engineering vs. optimization/gradient-based attacks; alignment-based vs. detection-based defenses); (2) inclusion of both closed-source (GPT family) and open-source (Llama, Vicuna) model families; (3) prominence and recency in the literature. We will also include a coverage matrix showing how the chosen methods map onto the taxonomy axes, thereby demonstrating breadth and mitigating concerns of selection bias. These additions will be placed before the results tables so that the distilled findings rest on a transparent foundation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the SoK taxonomy, Security Cube framework, or benchmarks

full rationale

This is a systematization-of-knowledge paper that surveys existing jailbreak literature, proposes an independent multi-dimensional evaluation framework (Security Cube), and runs benchmarks on 13 attacks plus 5 defenses drawn from prior work. No mathematical derivations, parameter-fitting steps, or predictions appear in the provided text. The central claims rest on explicit selection of representative methods and construction of comparison tables rather than any self-referential definition, fitted-input-as-prediction, or load-bearing self-citation chain. The framework and taxonomy are presented as new organizing structures built on external literature, not as reductions of the paper's own inputs. Selection of the 13+5 methods is a representativeness judgment, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces a new evaluation framework but contains no mathematical free parameters, axioms, or invented physical entities; the framework itself is the primary addition.

invented entities (1)

Security Cube no independent evidence
purpose: Unified multi-dimensional framework for comprehensive evaluation of jailbreak attacks and defenses
Presented as a new construct in the paper to address limitations of single-metric evaluations.

pith-pipeline@v0.9.0 · 5558 in / 1191 out tokens · 26036 ms · 2026-05-08T16:38:39.510275+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

112 extracted references · 40 canonical work pages · 9 internal anchors

[1]

Detecting language model attacks with perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

work page arXiv 2023
[2]

Jailbreaking leading safety-aligned llms with simple adaptive attacks

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammar- ion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. InICLR, 2025

2025
[3]

Claude 3.7 sonnet system card

Anthropic. Claude 3.7 sonnet system card. https://www.anthropic.com/news/claude-3-7-sonnet, 2025

2025
[4]

Mt- bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt- bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InACL, pages 7421–7454, 2024

2024
[5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmless- ness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review arXiv 2022
[6]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul R ¨ottger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. InICLR, 2024

2024
[7]

A realistic threat model for large language model jailbreaks.arXiv preprint arXiv:2410.16222, 2024

Valentyn Boreiko, Alexander Panfilov, Vaclav V oracek, Matthias Hein, and Jonas Geiping. A realistic threat model for large language model jailbreaks.arXiv preprint arXiv:2410.16222, 2024

work page arXiv 2024
[8]

Baryannis, G., Validi, S., Dani, S., and Antoniou, G

Hannah Brown, Leon Lin, Kenji Kawaguchi, and Michael Shieh. Self-evaluation as a defense against adversarial attacks on llms. ArXiv, abs/2407.03234, 2024

work page arXiv 2024
[9]

Safedialbench: A fine-grained safety benchmark for large language models in multi-turn dialogues with diverse jailbreak attacks.ArXiv, abs/2502.11090, 2025

Hongye Cao, Yanming Wang, Sijia Jing, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Boyan Wang, Jiaheng Liu, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, and Junlan Feng. Safedialbench: A fine-grained safety benchmark for large language models in multi-turn dialogues with diverse jailbreak attacks.ArXiv, abs/2502.11090, 2025

work page arXiv 2025
[10]

Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios.Journal of medical systems, 47(1):33, 2023

Marco Cascella, Jonathan Montomoli, Valentina Bellini, and Elena Bignami. Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios.Journal of medical systems, 47(1):33, 2023

2023
[11]

Pappas, Florian Tram `er, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram `er, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.NIPS, 37:55005– 55029, 2024

2024
[12]

Jailbreaking black box large language models in twenty queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. InSaTML, pages 23–42. IEEE, 2025

2025
[13]

Comprehensive assessment of jailbreak attacks against LLMs

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Comprehensive assessment of jailbreak attacks against llms.arXiv preprint arXiv:2402.05668, 2024

work page arXiv 2024
[14]

Safe rlhf: Safe reinforcement learning from human feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InICLR, 2024

2024
[15]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

2025
[16]

Multilingual jailbreak challenges in large language models

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models. InICLR, 2024

2024
[17]

Garak: A framework for security probing of large language models

Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A framework for security probing large language models.arXiv preprint arXiv:2406.11036, 2024

work page arXiv 2024
[18]

A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. In NAACL, pages 2136–2153, 2024

2024
[19]

A compre- hensive survey of attack techniques, implementation, and mitigation strategies in large language models

Aysan Esmradi, Daniel Wankit Yip, and Chun Fai Chan. A compre- hensive survey of attack techniques, implementation, and mitigation strategies in large language models. InUbiSec, 2023

2023
[20]

Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971

Joseph L Fleiss. Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971

1971
[21]

Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models

Lang Gao, Xiangliang Zhang, Preslav Nakov, and Xiuying Chen. Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models. InACL, 2025

2025
[22]

Gemini-2.0-flash documents

Google. Gemini-2.0-flash documents. https://modelcards.withgoogle.com/assets/documents/gemini-2- flash.pdf, 2025

2025
[23]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[24]

Melody Y . Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models, 2025

2025
[25]

Cold-attack: Jailbreaking llms with stealthiness and controllability

Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. Cold-attack: Jailbreaking llms with stealthiness and controllability. InICML, pages 16974–17002. PMLR, 2024

2024
[26]

Columbia university, 1997

Orris C Herfindahl.Concentration in the steel industry. Columbia university, 1997

1997
[27]

Dualbreach: Efficient dual-jailbreaking via target-driven initialization and multi-target op- timization.ArXiv, abs/2504.18564, 2025

Xinzhe Huang, Kedong Xiu, Tianhang Zheng, Churui Zeng, Wangze Ni, Zhan Qiin, Kui Ren, and Chun Chen. Dualbreach: Efficient dual-jailbreaking via target-driven initialization and multi-target op- timization.ArXiv, abs/2504.18564, 2025

work page arXiv 2025
[28]

Best-of-NJailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556, 2024

work page arXiv 2024
[29]

Llama guard: Llm- based input-output safeguard for human-ai conversations, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm- based input-output safeguard for human-ai conversations, 2023

2023
[30]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beu- tel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Aligner: Efficient alignment by learning to correct

Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Tianyi Qiu, Juntao Dai, and Yaodong Yang. Aligner: Efficient alignment by learning to correct. InNIPS, 2024

2024
[32]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023

2023
[33]

Artprompt: Ascii art-based jailbreak attacks against aligned llms

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. Artprompt: Ascii art-based jailbreak attacks against aligned llms. InACL, 2024

2024
[34]

Chatgpt for good? on opportunities and challenges of large language models for education

Enkelejda Kasneci, Kathrin Seßler, Stefan K ¨uchemann, Maria Ban- nert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan G¨unnemann, Eyke H ¨ullermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023

2023
[35]

h4rm3l: A dynamic benchmark of composable jailbreak attacks for llm safety assessment.arXiv e- prints, pages arXiv–2408, 2024

Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D Manning. h4rm3l: A dynamic benchmark of composable jailbreak attacks for llm safety assessment.arXiv e- prints, pages arXiv–2408, 2024

2024
[36]

Certifying llm safety against adversarial prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying llm safety against adversarial prompting. InCOLM, 2024

2024
[37]

Jailpo: A novel black-box jailbreak framework via preference optimization against aligned llms

Hongyi Li, Jiawei Ye, Jie Wu, Tianjie Yan, Chu Wang, and Zhixin Li. Jailpo: A novel black-box jailbreak framework via preference optimization against aligned llms. InAAAI, pages 27419–27427, 2025

2025
[38]

One model transfer to all: On robust jailbreak prompts generation against llms

Linbao Li, Yannan Liu, Daojing He, and Yu Li. One model transfer to all: On robust jailbreak prompts generation against llms. InICLR, 2025

2025
[39]

Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers. InEMNLP, 2024

2024
[40]

Deepincep- tion: Hypnotize large language model to be jailbreaker

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191, 2023

work page arXiv 2023
[41]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gul- rajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca eval, 5 2023

2023
[42]

Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms

Zeyi Liao and Huan Sun. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. InCOLM, 2024

2024
[43]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review arXiv 2024
[44]

Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, and Hao Liu. Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.ArXiv, abs/2410.12855, 2024

work page arXiv 2024
[45]

Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao

Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. AutoDAN-turbo: A lifelong agent for strategy self-exploration to jailbreak LLMs. InICLR, 2025

2025
[46]

Autodan: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. InICLR, 2024

2024
[47]

Auto- rt: Automatic jailbreak strategy exploration for red-teaming large language models.ArXiv, abs/2501.01830, 2025

Yanjiang Liu, Shuheng Zhou, Yaojie Lu, Huijia Zhu, Weiqiang Wang, Hongyu Lin, Ben He, Xianpei Han, and Le Sun. Auto- rt: Automatic jailbreak strategy exploration for red-teaming large language models.ArXiv, abs/2501.01830, 2025

work page arXiv 2025
[48]

doi:10.48550/arXiv.2410.02832 , abstract =

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv preprint arXiv:2410.02832, 2024

work page arXiv 2024
[49]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model safety. arXiv preprint arXiv:2502.05206, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

You can’t eat your cake and have it too: The performance degradation of llms with jailbreak defense

Wuyuao Mai, Geng Hong, Pei Chen, Xudong Pan, Baojun Liu, Yuan Zhang, Haixin Duan, and Min Yang. You can’t eat your cake and have it too: The performance degradation of llms with jailbreak defense. InWWW, pages 872–883, 2025

2025
[51]

Harmbench: A standardized evaluation framework for auto- mated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for auto- mated red teaming and robust refusal. InICML, pages 35181–35224. PMLR, 2024

2024
[52]

Tree of attacks: Jailbreaking black-box llms automatically.NIPS, 37:61065– 61105, 2024

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nel- son, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically.NIPS, 37:61065– 61105, 2024

2024
[53]

Fight back against jailbreaking via prompt adversarial tuning

Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. InNIPS, 2024

2024
[54]

Sg-bench: Evaluating llm safety generalization across diverse tasks and prompt types.NIPS, 37:123032–123054, 2024

Yutao Mou, Shikun Zhang, and Wei Ye. Sg-bench: Evaluating llm safety generalization across diverse tasks and prompt types.NIPS, 37:123032–123054, 2024

2024
[55]

Accessed: 2024-02- 25

Zvi Mowshowitz., 2022. Accessed: 2024-02- 25

2022
[56]

and Minnich, Amanda J

Gary D Lopez Munoz, Amanda J Minnich, Roman Lutz, Richard Lundeen, Raja Sekhar Rao Dheekonda, Nina Chikanov, Bolor- Erdene Jagdagdorj, Martin Pouliot, Shiven Chawla, Whitney Maxwell, et al. Pyrit: A framework for security risk identifi- cation and red teaming in generative ai system.arXiv preprint arXiv:2410.02828, 2024

work page arXiv 2024
[57]

arXiv preprint arXiv:2402.02309 , year=

Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

work page arXiv 2024
[58]

gpt-3.5-turbo system card

OpenAI. gpt-3.5-turbo system card. https://platform.openai.com/docs/models/gpt-3.5-turbo, 2023

2023
[59]

How we think about safety and alignment

OpenAI. How we think about safety and alignment. https://openai.com/safety/how-we-think-about-safety-alignment/, 2025

2025
[60]

Training language models to follow instruc- tions with human feedback.NIPS, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instruc- tions with human feedback.NIPS, 35:27730–27744, 2022

2022
[61]

Jailbreaking and mitigation of vulnerabilities in large language models.arXiv preprint arXiv:2410.15236, 2024

Benji Peng, Ziqian Bi, Qian Niu, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence KQ Yan, Yizhu Wen, Yichao Zhang, and Cait- lyn Heqi Yin. Jailbreaking and mitigation of vulnerabilities in large language models.arXiv preprint arXiv:2410.15236, 2024

work page arXiv 2024
[62]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In ICLR, 2025

2025
[63]

Qwen-2.5-max technical report

Qwen. Qwen-2.5-max technical report. https://www.researchgate.net/publication/388485555 Technical Report on Qwen-25 Max, 2025

work page arXiv 2025
[64]

Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models

Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, and Anyu Wang. Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models. ArXiv, abs/2406.09321, 2024

work page arXiv 2024
[65]

Tricking llms into disobedience: Formalizing, analyzing, and detecting jailbreaks

Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking llms into disobedience: Formalizing, analyzing, and detecting jailbreaks. InLREC, 2023. 16

2023
[66]

Codeattack: Revealing safety generalization challenges of large language models via code completion

Qibing Ren, Chang Gao, Jing Shao, Junchi Yan, Xin Tan, Wai Lam, and Lizhuang Ma. Codeattack: Revealing safety generalization challenges of large language models via code completion. InACL, pages 11437–11452, 2024

2024
[67]

LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts.ACL, 2025

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts.ACL, 2025

2025
[68]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pap- pas. Smoothllm: Defending large language models against jailbreak- ing attacks.ArXiv, abs/2310.03684, 2023

work page internal anchor Pith review arXiv 2023
[69]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review arXiv 2023
[70]

Great, now write an article about that: The crescendo multi-turn llm jailbreak attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. InUSENIX Security, pages 2421–2440, 2025

2025
[71]

Pandaguard: Systematic evaluation of llm safety against jailbreaking attacks.arXiv preprint arXiv:2505.13862, 2025

Guobin Shen, Dongcheng Zhao, Linghao Feng, Xiang He, Jihang Wang, Sicheng Shen, Haibo Tong, Yiting Dong, Jindong Li, Xiang Zheng, et al. Pandaguard: Systematic evaluation of llm safety against jailbreaking attacks.arXiv preprint arXiv:2505.13862, 2025

work page arXiv 2025
[72]

” do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ” do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models. InACM SIGSAC, pages 1671–1685, 2024

2024
[73]

Pal: Proxy-guided black-box attack on large language models,

Chawin Sitawarin, Norman Mu, David Wagner, and Alexandre Araujo. Pal: Proxy-guided black-box attack on large language models.ArXiv, abs/2402.09674, 2024

work page arXiv 2024
[74]

Trustllm: Trustworthiness in large language models

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large language models. InICML, pages 20166–20270, 2024

2024
[75]

Angel, and Arul Jothi

Sadaf Surur, J. Angel, and Arul Jothi. Jailbreak attacks on large language models and possible defenses: Present status and future possibilities.ISTAS, pages 1–7, 2024

2024
[76]

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.NIPS, 36:31232–31339, 2023

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.NIPS, 36:31232–31339, 2023

2023
[77]

Mrj-agent: An effective jailbreak agent for multi-round dialogue.arXiv preprint arXiv:2411.03814, 2024

Fengxiang Wang, Ranjie Duan, Peng Xiao, Xiaojun Jia, Shiji Zhao, Cheng Wei, YueFeng Chen, Chongwen Wang, Jialing Tao, Hang Su, et al. Mrj-agent: An effective jailbreak agent for multi-round dialogue.arXiv preprint arXiv:2411.03814, 2024

work page arXiv 2024
[78]

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InEMNLP, pages 8696–8708. Asso- ciation for Computational Linguistics (ACL), 2021

2021
[79]

Exploring the potential of large language models for automation in technical customer service.arXiv preprint arXiv:2405.09161, 2024

Jochen Wulf and J ¨urg Meierhofer. Exploring the potential of large language models for automation in technical customer service.arXiv preprint arXiv:2405.09161, 2024

work page arXiv 2024
[80]

Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelli- gence, 5:1486–1496, 2023

Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelli- gence, 5:1486–1496, 2023

2023

Showing first 80 references.