arxiv: 2404.01318 · v5 · submitted 2024-03-28 · 💻 cs.CR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao , Edoardo Debenedetti , Alexander Robey , Maksym Andriushchenko , Francesco Croce , Vikash Sehwag , Edgar Dobriban , Nicolas Flammarion

show 4 more authors

George J. Pappas Florian Tramer Hamed Hassani Eric Wong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:04 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords jailbreakinglarge language modelsadversarial promptsbenchmarkAI safetyrobustness evaluationreproducibility

0 comments

The pith

JailbreakBench supplies an open repository of adversarial prompts, a 100-behavior dataset, a fixed evaluation framework, and a public leaderboard to make jailbreak comparisons reproducible across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JailbreakBench to fix inconsistent metrics, hidden prompts, and non-reproducible results that currently block progress on measuring jailbreak attacks against large language models. It packages an evolving set of state-of-the-art adversarial prompts, a dataset of 100 behaviors drawn from prior work and aligned with usage policies, plus a uniform threat model, chat templates, and scoring rules hosted in open code. A public leaderboard then records attack success rates and defense performance on multiple LLMs so that new methods can be added and ranked over time. A sympathetic reader would care because shared, verifiable numbers replace the current situation where each paper reports incomparable success rates on private prompts. The authors position the release as a net community benefit after weighing ethical risks of distributing the artifacts.

Core claim

JailbreakBench is an open benchmark that combines an evolving repository of jailbreak artifacts, a dataset of 100 behaviors, a standardized evaluation framework that specifies threat model, system prompts, and scoring functions, and a public leaderboard that tracks attack and defense performance across LLMs.

What carries the argument

The JailbreakBench evaluation framework, which fixes the threat model, chat templates, and scoring functions so that success rates become directly comparable across papers and models.

If this is right

Attack papers can report success rates and query costs that other researchers can replicate exactly.
The leaderboard will show which attacks remain effective as new LLMs and defenses are released.
Defenses can be tested against the same evolving prompt repository instead of author-chosen subsets.
New behaviors can be added to the dataset while keeping earlier results comparable through the fixed framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark may encourage defense papers to report robustness numbers on the same public artifacts rather than private test sets.
Over time the leaderboard could reveal whether certain defense techniques generalize across the full range of behaviors or only on narrow subsets.
Researchers working on multimodal or agent-based attacks could extend the same structure by adding new behavior categories.
The open artifact repository creates a natural place to archive prompts that stop working, documenting the moving target of LLM safety.

Load-bearing premise

The selected 100 behaviors together with the chosen scoring functions capture the main real-world jailbreaking risks without systematic bias toward or against particular attack styles.

What would settle it

A newly published attack that reaches high success rates on current production LLMs yet shows low scores when run through the benchmark's fixed 100-behavior set and scoring rules.

read the original abstract

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JailbreakBench delivers a practical open setup with public prompts, a 100-behavior dataset, fixed evaluation code, and a leaderboard to make jailbreak results comparable.

read the letter

JailbreakBench is a practical step toward fixing the inconsistent and hard-to-reproduce state of jailbreak evaluations. The authors release an evolving repository of adversarial prompts, a dataset of 100 behaviors drawn from prior work and aligned to OpenAI policies, a clear evaluation framework with threat model, system prompts, chat templates, and scoring functions, plus a public leaderboard. This directly tackles the problems of incomparable costs, success rates, and withheld artifacts that the paper calls out in earlier studies.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces JailbreakBench, an open-sourced benchmark for evaluating jailbreak attacks on LLMs. It comprises (1) an evolving repository of adversarial prompts (jailbreak artifacts), (2) a dataset of 100 behaviors (original and sourced from Zou et al. 2023 and Mazeika et al. 2023/2024) aligned with OpenAI usage policies, (3) a standardized evaluation framework with explicit threat model, system prompts, chat templates, and scoring functions at https://github.com/JailbreakBench/jailbreakbench, and (4) a public leaderboard at https://jailbreakbench.github.io/ tracking attack and defense performance across LLMs. The work targets the lack of standardization, incomparable metrics, and non-reproducibility in prior jailbreaking evaluations.

Significance. If the released artifacts match the described components, the benchmark supplies a reproducible, community-maintainable standard that directly enables fair cross-paper comparisons and reduces reliance on proprietary or withheld prompts. The explicit public links, alignment with prior datasets, and ethical considerations section are concrete strengths that support ongoing use and extension by the field.

minor comments (3)

[Abstract] Abstract: the motivation paragraph on non-reproducibility would be strengthened by citing two or three concrete prior works that withhold prompts or rely on closed APIs.
[Dataset description] Dataset section: provide a short table or appendix listing the 100 behaviors by category (e.g., fraud, violence, privacy) and indicating which are original versus sourced, to let readers assess coverage balance.
[Evaluation framework] Evaluation framework: clarify the exact versioning policy for the evolving jailbreak-artifact repository and how new artifacts will be added without breaking the fixed 100-behavior test set.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our work, which correctly identifies the core components of JailbreakBench and its goals of improving standardization and reproducibility in jailbreak evaluations. We are pleased that the referee recommends acceptance.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces JailbreakBench as a new open benchmark with explicitly defined components: an evolving repository of adversarial prompts, a dataset of 100 behaviors (partly sourced from prior work but not used to derive the benchmark's claims), a standardized evaluation framework with threat model/system prompts/chat templates/scoring functions, and a public leaderboard. No equations, predictions, fitted parameters, or first-principles derivations exist that could reduce to inputs by construction. Citations to Zou et al. and Mazeika et al. supply source behaviors only and do not load-bear the reproducibility or standardization claims, which rest on the released GitHub repository and explicit documentation. The central contribution is self-contained and externally verifiable via the open artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on standard domain assumptions about LLM usage and safety policies rather than new free parameters or invented entities.

axioms (1)

domain assumption The chosen threat model, system prompts, and scoring functions accurately reflect practical jailbreaking scenarios and success criteria.
Invoked when defining the standardized evaluation framework.

pith-pipeline@v0.9.0 · 5609 in / 1239 out tokens · 40511 ms · 2026-05-15T06:04:40.711130+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges... we introduce JailbreakBench, an open-sourced benchmark with... an evolving repository of state-of-the-art adversarial prompts... a jailbreaking dataset comprising 100 behaviors... a standardized evaluation framework... and a leaderboard
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
cs.CR 2026-04 unverdicted novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
cs.CR 2026-05 unverdicted novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
cs.CL 2026-05 unverdicted novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
cs.CR 2026-05 conditional novelty 6.0

ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
cs.AI 2026-05 unverdicted novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
cs.CR 2026-05 conditional novelty 6.0

Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
cs.AI 2026-04 unverdicted novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Jailbreaking Black Box Large Language Models in Twenty Queries
cs.LG 2023-10 conditional novelty 6.0

PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Re-Triggering Safeguards within LLMs for Jailbreak Detection
cs.CR 2026-05 unverdicted novelty 5.0

Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
cs.CR 2026-05 accept novelty 5.0

The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
Cross-Lingual Jailbreak Detection via Semantic Codebooks
cs.CL 2026-04 unverdicted novelty 5.0

Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing
cs.CR 2026-04 unverdicted novelty 5.0

Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp rob...
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
cs.CL 2026-05 unverdicted novelty 2.0

DExperts blocks explicit toxicity at 100% but drops to 98.5% on implicit hate speech while increasing generation latency by roughly 10x.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 20 Pith papers · 14 internal anchors

[1]

Are you still on track!? catching llm task drift with activations

Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew Paverd. Are you still on track!? catching llm task drift with activations. arXiv preprint arXiv:2406.00799, 2024

work page arXiv 2024
[2]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

work page 2024
[3]

Croissant: A metadata format for ml-ready datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, and Carole-Jean Wu. Croissant...

work page doi:10.1145/3650203.3663326 2024
[4]

Jailbreak chat

Alex Albert. Jailbreak chat. https://www.jailbreakchat.com, 2023. Accessed: 2024-02-20

work page 2023
[5]

Detecting language model attacks with perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023

work page arXiv 2023
[6]

Jailbreaking leading safety-aligned llms with simple adaptive attacks

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024

work page arXiv 2024
[7]

Refusal in llms is mediated by a single direction

Andy Arditi, Oscar Balcells, Aaquib Syed, Wes Gurnee, and Neel Nanda. Refusal in llms is mediated by a single direction. Alignment Forum, 2024. URL https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

work page 2024
[8]

Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W Koh, Daphne Ippolito, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[9]

Non-determinism in gpt-4 is caused by sparse moe, 2023

Sherman Chann. Non-determinism in gpt-4 is caused by sparse moe, 2023. URL https://152334h.github.io/blog/non-determinism-in-gpt-4/

work page 2023
[10]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Robustbench: a standardized adversarial robustness benchmark

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. NeurIPS Datasets and Benchmarks Track, 2021

work page 2021
[12]

Multilingual jailbreak challenges in large language models

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023

work page arXiv 2023
[13]

a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \

Simon Geisler, Tom Wollschl \"a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \"u nnemann. Attacking large language models with projected gradient descent. arXiv preprint arXiv:2402.09154, 2024

work page arXiv 2024
[14]

Gemini v1.5 report

Gemini Team . Gemini v1.5 report. Technical report, Google, 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

work page 2024
[15]

Query-based adversarial prompt generation

Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tram \`e r, and Milad Nasr. Query-based adversarial prompt generation. arXiv preprint arXiv:2402.12329, 2024

work page arXiv 2024
[16]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021

work page 2021
[17]

Catastrophic jailbreak of open-source llms via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023

work page arXiv 2023
[18]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Defending large language models against jailbreak attacks via semantic smoothing

Jiabao Ji, Bairu Hou, Alexander Robey, George J Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending large language models against jailbreak attacks via semantic smoothing. arXiv preprint arXiv:2402.16192, 2024

work page arXiv 2024
[22]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models

Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, and Haohan Wang. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. arXiv preprint arXiv:2402.03299, 2024 a

work page arXiv 2024
[24]

Menke, and Haohan Wang

Haibo Jin, Andy Zhou, Joe D. Menke, and Haohan Wang. Jailbreaking large language models against moderation guardrails via cipher characters. arXiv preprint arXiv:2405.20413, 2024 b

work page arXiv 2024
[25]

Certifying llm safety against adversarial prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023

work page arXiv 2023
[26]

Open sesame! universal black box jailbreaking of large language models

Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023

work page arXiv 2023
[27]

No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks

Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, and Wenjie Li. No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks. arXiv preprint arXiv:2405.16229, 2024

work page arXiv 2024
[28]

Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment

Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, and Kailong Wang. Lockpicking llms: A logit-based jailbreak using token-level manipulation. arXiv preprint arXiv:2405.13068, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Meta llama guard 2

Llama Team . Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024

work page 2024
[31]

A safe harbor for ai evaluation and red teaming

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, and Peter Henderson. A safe harb...

work page arXiv 2024
[32]

Tdc 2023 (llm edition): The trojan detection challenge

Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, Adam Khoja, Fengqing Jiang, Aidan O'Gara, Ellie Sakhaee, Zhen Xiang, Arezoo Rajabi, Dan Hendrycks, Radha Poovendran, Bo Li, and David Forsyth. Tdc 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track, 2023

work page 2023
[33]

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In ICML, 2024

work page 2024
[34]

Tree of attacks: Jailbreaking black-box llms automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023

work page arXiv 2023
[35]

Jailbreaking chatgpt on release day

Zvi Mowshowitz. Jailbreaking chatgpt on release day. https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day, 2022. Accessed: 2024-02-25

work page 2022
[36]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[37]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[38]

Navigating the safety landscape: Measuring risks in finetuning large language models

Sheng-Hsuan Peng, Pin-Yu Chen, Matthew Hull, and Duen Horng Chau. Navigating the safety landscape: Measuring risks in finetuning large language models. arXiv preprint arXiv:2405.17374, 2024

work page arXiv 2024
[39]

Data cards: Purposeful and transparent dataset documentation for responsible AI

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI . In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22. ACM, 2022. doi:10.1145/3531146.3533231

work page doi:10.1145/3531146.3533231 2022
[40]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[41]

Find the trojan: Universal backdoor detection in aligned llms

Javier Rando, Stephen Casper, and Florian Tramer. Find the trojan: Universal backdoor detection in aligned llms. In SatML Challenge, 2024. URL https://github.com/ethz-spylab/rlhf_trojan_competition

work page 2024
[42]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 1 0 (10), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul R \"o ttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023

work page internal anchor Pith review arXiv 2023
[44]

Scalable and transferable black-box jailbreaks for language models via persona modulation

Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023

work page arXiv 2023
[45]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023

work page arXiv 2023
[46]

Pal: Proxy-guided black-box attack on large language models

Chawin Sitawarin, Norman Mu, David Wagner, and Alexandre Araujo. Pal: Proxy-guided black-box attack on large language models. arXiv preprint arXiv:2402.09674, 2024

work page arXiv 2024
[47]

A strongreject for empty jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024

work page arXiv 2024
[48]

rspeer/wordfreq: v3.0, September 2022

Robyn Speer. rspeer/wordfreq: v3.0, September 2022. URL https://doi.org/10.5281/zenodo.7199437

work page doi:10.5281/zenodo.7199437 2022
[49]

Trustllm: Trustworthiness in large language models

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024

work page arXiv 2024
[50]

All in how you ask for it: Simple black-box method for jailbreak attacks

Kazuhiro Takemoto. All in how you ask for it: Simple black-box method for jailbreak attacks. arXiv preprint arXiv:2401.09798, 2024

work page arXiv 2024
[51]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

On adaptive attacks to adversarial example defenses

Florian Tram \`e r, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In NeurIPS, 2020

work page 2020
[53]

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023
[54]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks

Chen Xiong, Xiangyu Qi, Pin-Yu Chen, and Tsung-Yi Ho. Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks. arXiv preprint arXiv:2405.20099, 2024

work page arXiv 2024
[56]

Low-resource languages jailbreak gpt-4

Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023

work page arXiv 2023
[57]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023

work page internal anchor Pith review arXiv 2023
[58]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024

work page arXiv 2024
[59]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Improved few-shot jailbreaking can circumvent aligned language models and their defenses

Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288, 2024

work page arXiv 2024
[61]

Easyjailbreak: A unified framework for jailbreaking large language models, 2024

Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao, Tao Gui, Qi Zhang, and Xuanjing Huang. Easyjailbreak: A unified framework for jailbreaking large language models, 2024

work page 2024
[62]

Promptbench: A unified library for evaluation of large language models

Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, and Xing Xie. Promptbench: A unified library for evaluation of large language models. arXiv preprint arXiv:2312.07910, 2023

work page arXiv 2023
[63]

Randomness in neural network training: Characterizing the impact of tooling

Donglin Zhuang, Xingyao Zhang, Shuaiwen Song, and Sara Hooker. Randomness in neural network training: Characterizing the impact of tooling. Proceedings of Machine Learning and Systems, 4: 0 316--336, 2022

work page 2022
[64]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023