pith. machine review for the scientific record. sign in

arxiv: 2404.01318 · v5 · submitted 2024-03-28 · 💻 cs.CR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:04 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords jailbreakinglarge language modelsadversarial promptsbenchmarkAI safetyrobustness evaluationreproducibility
0
0 comments X

The pith

JailbreakBench supplies an open repository of adversarial prompts, a 100-behavior dataset, a fixed evaluation framework, and a public leaderboard to make jailbreak comparisons reproducible across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JailbreakBench to fix inconsistent metrics, hidden prompts, and non-reproducible results that currently block progress on measuring jailbreak attacks against large language models. It packages an evolving set of state-of-the-art adversarial prompts, a dataset of 100 behaviors drawn from prior work and aligned with usage policies, plus a uniform threat model, chat templates, and scoring rules hosted in open code. A public leaderboard then records attack success rates and defense performance on multiple LLMs so that new methods can be added and ranked over time. A sympathetic reader would care because shared, verifiable numbers replace the current situation where each paper reports incomparable success rates on private prompts. The authors position the release as a net community benefit after weighing ethical risks of distributing the artifacts.

Core claim

JailbreakBench is an open benchmark that combines an evolving repository of jailbreak artifacts, a dataset of 100 behaviors, a standardized evaluation framework that specifies threat model, system prompts, and scoring functions, and a public leaderboard that tracks attack and defense performance across LLMs.

What carries the argument

The JailbreakBench evaluation framework, which fixes the threat model, chat templates, and scoring functions so that success rates become directly comparable across papers and models.

If this is right

  • Attack papers can report success rates and query costs that other researchers can replicate exactly.
  • The leaderboard will show which attacks remain effective as new LLMs and defenses are released.
  • Defenses can be tested against the same evolving prompt repository instead of author-chosen subsets.
  • New behaviors can be added to the dataset while keeping earlier results comparable through the fixed framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark may encourage defense papers to report robustness numbers on the same public artifacts rather than private test sets.
  • Over time the leaderboard could reveal whether certain defense techniques generalize across the full range of behaviors or only on narrow subsets.
  • Researchers working on multimodal or agent-based attacks could extend the same structure by adding new behavior categories.
  • The open artifact repository creates a natural place to archive prompts that stop working, documenting the moving target of LLM safety.

Load-bearing premise

The selected 100 behaviors together with the chosen scoring functions capture the main real-world jailbreaking risks without systematic bias toward or against particular attack styles.

What would settle it

A newly published attack that reaches high success rates on current production LLMs yet shows low scores when run through the benchmark's fixed 100-behavior set and scoring rules.

read the original abstract

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces JailbreakBench, an open-sourced benchmark for evaluating jailbreak attacks on LLMs. It comprises (1) an evolving repository of adversarial prompts (jailbreak artifacts), (2) a dataset of 100 behaviors (original and sourced from Zou et al. 2023 and Mazeika et al. 2023/2024) aligned with OpenAI usage policies, (3) a standardized evaluation framework with explicit threat model, system prompts, chat templates, and scoring functions at https://github.com/JailbreakBench/jailbreakbench, and (4) a public leaderboard at https://jailbreakbench.github.io/ tracking attack and defense performance across LLMs. The work targets the lack of standardization, incomparable metrics, and non-reproducibility in prior jailbreaking evaluations.

Significance. If the released artifacts match the described components, the benchmark supplies a reproducible, community-maintainable standard that directly enables fair cross-paper comparisons and reduces reliance on proprietary or withheld prompts. The explicit public links, alignment with prior datasets, and ethical considerations section are concrete strengths that support ongoing use and extension by the field.

minor comments (3)
  1. [Abstract] Abstract: the motivation paragraph on non-reproducibility would be strengthened by citing two or three concrete prior works that withhold prompts or rely on closed APIs.
  2. [Dataset description] Dataset section: provide a short table or appendix listing the 100 behaviors by category (e.g., fraud, violence, privacy) and indicating which are original versus sourced, to let readers assess coverage balance.
  3. [Evaluation framework] Evaluation framework: clarify the exact versioning policy for the evolving jailbreak-artifact repository and how new artifacts will be added without breaking the fixed 100-behavior test set.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our work, which correctly identifies the core components of JailbreakBench and its goals of improving standardization and reproducibility in jailbreak evaluations. We are pleased that the referee recommends acceptance.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces JailbreakBench as a new open benchmark with explicitly defined components: an evolving repository of adversarial prompts, a dataset of 100 behaviors (partly sourced from prior work but not used to derive the benchmark's claims), a standardized evaluation framework with threat model/system prompts/chat templates/scoring functions, and a public leaderboard. No equations, predictions, fitted parameters, or first-principles derivations exist that could reduce to inputs by construction. Citations to Zou et al. and Mazeika et al. supply source behaviors only and do not load-bear the reproducibility or standardization claims, which rest on the released GitHub repository and explicit documentation. The central contribution is self-contained and externally verifiable via the open artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on standard domain assumptions about LLM usage and safety policies rather than new free parameters or invented entities.

axioms (1)
  • domain assumption The chosen threat model, system prompts, and scoring functions accurately reflect practical jailbreaking scenarios and success criteria.
    Invoked when defining the standardized evaluation framework.

pith-pipeline@v0.9.0 · 5609 in / 1239 out tokens · 40511 ms · 2026-05-15T06:04:40.711130+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges... we introduce JailbreakBench, an open-sourced benchmark with... an evolving repository of state-of-the-art adversarial prompts... a jailbreaking dataset comprising 100 behaviors... a standardized evaluation framework... and a leaderboard

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

    cs.CR 2026-04 unverdicted novelty 8.0

    Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

  2. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  3. The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

    cs.CR 2026-05 unverdicted novelty 7.0

    A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...

  4. ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

    cs.CL 2026-05 unverdicted novelty 7.0

    ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

  5. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  6. The Great Pretender: A Stochasticity Problem in LLM Jailbreak

    cs.CR 2026-05 conditional novelty 6.0

    ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.

  7. Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.

  8. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

    cs.AI 2026-05 unverdicted novelty 6.0

    Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

  9. VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

    cs.CR 2026-05 conditional novelty 6.0

    Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.

  10. Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

    cs.AI 2026-04 unverdicted novelty 6.0

    Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

  11. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  12. Jailbreaking Black Box Large Language Models in Twenty Queries

    cs.LG 2023-10 conditional novelty 6.0

    PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.

  13. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    cs.LG 2023-10 accept novelty 6.0

    SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

  14. Re-Triggering Safeguards within LLMs for Jailbreak Detection

    cs.CR 2026-05 unverdicted novelty 5.0

    Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.

  15. A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

    cs.CR 2026-05 accept novelty 5.0

    The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.

  16. Cross-Lingual Jailbreak Detection via Semantic Codebooks

    cs.CL 2026-04 unverdicted novelty 5.0

    Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.

  17. Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing

    cs.CR 2026-04 unverdicted novelty 5.0

    Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp rob...

  18. Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

  19. Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    cs.CR 2024-07 accept novelty 4.0

    A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

  20. Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

    cs.CL 2026-05 unverdicted novelty 2.0

    DExperts blocks explicit toxicity at 100% but drops to 98.5% on implicit hate speech while increasing generation latency by roughly 10x.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 20 Pith papers · 14 internal anchors

  1. [1]

    Are you still on track!? catching llm task drift with activations

    Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew Paverd. Are you still on track!? catching llm task drift with activations. arXiv preprint arXiv:2406.00799, 2024

  2. [2]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

  3. [3]

    Croissant: A metadata format for ml-ready datasets

    Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, and Carole-Jean Wu. Croissant...

  4. [4]

    Jailbreak chat

    Alex Albert. Jailbreak chat. https://www.jailbreakchat.com, 2023. Accessed: 2024-02-20

  5. [5]

    Detecting language model attacks with perplexity

    Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023

  6. [6]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024

  7. [7]

    Refusal in llms is mediated by a single direction

    Andy Arditi, Oscar Balcells, Aaquib Syed, Wes Gurnee, and Neel Nanda. Refusal in llms is mediated by a single direction. Alignment Forum, 2024. URL https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

  8. [8]

    Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024

    Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W Koh, Daphne Ippolito, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024

  9. [9]

    Non-determinism in gpt-4 is caused by sparse moe, 2023

    Sherman Chann. Non-determinism in gpt-4 is caused by sparse moe, 2023. URL https://152334h.github.io/blog/non-determinism-in-gpt-4/

  10. [10]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023

  11. [11]

    Robustbench: a standardized adversarial robustness benchmark

    Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. NeurIPS Datasets and Benchmarks Track, 2021

  12. [12]

    Multilingual jailbreak challenges in large language models

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023

  13. [13]

    a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \

    Simon Geisler, Tom Wollschl \"a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \"u nnemann. Attacking large language models with projected gradient descent. arXiv preprint arXiv:2402.09154, 2024

  14. [14]

    Gemini v1.5 report

    Gemini Team . Gemini v1.5 report. Technical report, Google, 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

  15. [15]

    Query-based adversarial prompt generation

    Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tram \`e r, and Milad Nasr. Query-based adversarial prompt generation. arXiv preprint arXiv:2402.12329, 2024

  16. [16]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021

  17. [17]

    Catastrophic jailbreak of open-source llms via exploiting generation

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023

  18. [18]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024

  19. [19]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

  20. [20]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

  21. [21]

    Defending large language models against jailbreak attacks via semantic smoothing

    Jiabao Ji, Bairu Hou, Alexander Robey, George J Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending large language models against jailbreak attacks via semantic smoothing. arXiv preprint arXiv:2402.16192, 2024

  22. [22]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  23. [23]

    Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models

    Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, and Haohan Wang. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. arXiv preprint arXiv:2402.03299, 2024 a

  24. [24]

    Menke, and Haohan Wang

    Haibo Jin, Andy Zhou, Joe D. Menke, and Haohan Wang. Jailbreaking large language models against moderation guardrails via cipher characters. arXiv preprint arXiv:2405.20413, 2024 b

  25. [25]

    Certifying llm safety against adversarial prompting

    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023

  26. [26]

    Open sesame! universal black box jailbreaking of large language models

    Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023

  27. [27]

    No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks

    Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, and Wenjie Li. No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks. arXiv preprint arXiv:2405.16229, 2024

  28. [28]

    Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment

    Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, and Kailong Wang. Lockpicking llms: A logit-based jailbreak using token-level manipulation. arXiv preprint arXiv:2405.13068, 2024

  29. [29]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023

  30. [30]

    Meta llama guard 2

    Llama Team . Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024

  31. [31]

    A safe harbor for ai evaluation and red teaming

    Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, and Peter Henderson. A safe harb...

  32. [32]

    Tdc 2023 (llm edition): The trojan detection challenge

    Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, Adam Khoja, Fengqing Jiang, Aidan O'Gara, Ellie Sakhaee, Zhen Xiang, Arezoo Rajabi, Dan Hendrycks, Radha Poovendran, Bo Li, and David Forsyth. Tdc 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track, 2023

  33. [33]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In ICML, 2024

  34. [34]

    Tree of attacks: Jailbreaking black-box llms automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023

  35. [35]

    Jailbreaking chatgpt on release day

    Zvi Mowshowitz. Jailbreaking chatgpt on release day. https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day, 2022. Accessed: 2024-02-25

  36. [36]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  37. [37]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  38. [38]

    Navigating the safety landscape: Measuring risks in finetuning large language models

    Sheng-Hsuan Peng, Pin-Yu Chen, Matthew Hull, and Duen Horng Chau. Navigating the safety landscape: Measuring risks in finetuning large language models. arXiv preprint arXiv:2405.17374, 2024

  39. [39]

    Data cards: Purposeful and transparent dataset documentation for responsible AI

    Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI . In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22. ACM, 2022. doi:10.1145/3531146.3533231

  40. [40]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  41. [41]

    Find the trojan: Universal backdoor detection in aligned llms

    Javier Rando, Stephen Casper, and Florian Tramer. Find the trojan: Universal backdoor detection in aligned llms. In SatML Challenge, 2024. URL https://github.com/ethz-spylab/rlhf_trojan_competition

  42. [42]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 1 0 (10), 2023

  43. [43]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Paul R \"o ttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023

  44. [44]

    Scalable and transferable black-box jailbreaks for language models via persona modulation

    Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023

  45. [45]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023

  46. [46]

    Pal: Proxy-guided black-box attack on large language models

    Chawin Sitawarin, Norman Mu, David Wagner, and Alexandre Araujo. Pal: Proxy-guided black-box attack on large language models. arXiv preprint arXiv:2402.09674, 2024

  47. [47]

    A strongreject for empty jailbreaks

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024

  48. [48]

    rspeer/wordfreq: v3.0, September 2022

    Robyn Speer. rspeer/wordfreq: v3.0, September 2022. URL https://doi.org/10.5281/zenodo.7199437

  49. [49]

    Trustllm: Trustworthiness in large language models

    Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024

  50. [50]

    All in how you ask for it: Simple black-box method for jailbreak attacks

    Kazuhiro Takemoto. All in how you ask for it: Simple black-box method for jailbreak attacks. arXiv preprint arXiv:2401.09798, 2024

  51. [51]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  52. [52]

    On adaptive attacks to adversarial example defenses

    Florian Tram \`e r, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In NeurIPS, 2020

  53. [53]

    Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  54. [54]

    Jailbroken: How Does LLM Safety Training Fail?

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

  55. [55]

    Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks

    Chen Xiong, Xiangyu Qi, Pin-Yu Chen, and Tsung-Yi Ho. Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks. arXiv preprint arXiv:2405.20099, 2024

  56. [56]

    Low-resource languages jailbreak gpt-4

    Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023

  57. [57]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    Jiahao Yu, Xingwei Lin, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023

  58. [58]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024

  59. [59]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

  60. [60]

    Improved few-shot jailbreaking can circumvent aligned language models and their defenses

    Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288, 2024

  61. [61]

    Easyjailbreak: A unified framework for jailbreaking large language models, 2024

    Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao, Tao Gui, Qi Zhang, and Xuanjing Huang. Easyjailbreak: A unified framework for jailbreaking large language models, 2024

  62. [62]

    Promptbench: A unified library for evaluation of large language models

    Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, and Xing Xie. Promptbench: A unified library for evaluation of large language models. arXiv preprint arXiv:2312.07910, 2023

  63. [63]

    Randomness in neural network training: Characterizing the impact of tooling

    Donglin Zhuang, Xingyao Zhang, Shuaiwen Song, and Sara Hooker. Randomness in neural network training: Characterizing the impact of tooling. Proceedings of Machine Learning and Systems, 4: 0 316--336, 2022

  64. [64]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023