Recognition: 2 theorem links
· Lean TheoremJailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Pith reviewed 2026-05-15 06:04 UTC · model grok-4.3
The pith
JailbreakBench supplies an open repository of adversarial prompts, a 100-behavior dataset, a fixed evaluation framework, and a public leaderboard to make jailbreak comparisons reproducible across models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JailbreakBench is an open benchmark that combines an evolving repository of jailbreak artifacts, a dataset of 100 behaviors, a standardized evaluation framework that specifies threat model, system prompts, and scoring functions, and a public leaderboard that tracks attack and defense performance across LLMs.
What carries the argument
The JailbreakBench evaluation framework, which fixes the threat model, chat templates, and scoring functions so that success rates become directly comparable across papers and models.
If this is right
- Attack papers can report success rates and query costs that other researchers can replicate exactly.
- The leaderboard will show which attacks remain effective as new LLMs and defenses are released.
- Defenses can be tested against the same evolving prompt repository instead of author-chosen subsets.
- New behaviors can be added to the dataset while keeping earlier results comparable through the fixed framework.
Where Pith is reading between the lines
- The benchmark may encourage defense papers to report robustness numbers on the same public artifacts rather than private test sets.
- Over time the leaderboard could reveal whether certain defense techniques generalize across the full range of behaviors or only on narrow subsets.
- Researchers working on multimodal or agent-based attacks could extend the same structure by adding new behavior categories.
- The open artifact repository creates a natural place to archive prompts that stop working, documenting the moving target of LLM safety.
Load-bearing premise
The selected 100 behaviors together with the chosen scoring functions capture the main real-world jailbreaking risks without systematic bias toward or against particular attack styles.
What would settle it
A newly published attack that reaches high success rates on current production LLMs yet shows low scores when run through the benchmark's fixed 100-behavior set and scoring rules.
read the original abstract
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces JailbreakBench, an open-sourced benchmark for evaluating jailbreak attacks on LLMs. It comprises (1) an evolving repository of adversarial prompts (jailbreak artifacts), (2) a dataset of 100 behaviors (original and sourced from Zou et al. 2023 and Mazeika et al. 2023/2024) aligned with OpenAI usage policies, (3) a standardized evaluation framework with explicit threat model, system prompts, chat templates, and scoring functions at https://github.com/JailbreakBench/jailbreakbench, and (4) a public leaderboard at https://jailbreakbench.github.io/ tracking attack and defense performance across LLMs. The work targets the lack of standardization, incomparable metrics, and non-reproducibility in prior jailbreaking evaluations.
Significance. If the released artifacts match the described components, the benchmark supplies a reproducible, community-maintainable standard that directly enables fair cross-paper comparisons and reduces reliance on proprietary or withheld prompts. The explicit public links, alignment with prior datasets, and ethical considerations section are concrete strengths that support ongoing use and extension by the field.
minor comments (3)
- [Abstract] Abstract: the motivation paragraph on non-reproducibility would be strengthened by citing two or three concrete prior works that withhold prompts or rely on closed APIs.
- [Dataset description] Dataset section: provide a short table or appendix listing the 100 behaviors by category (e.g., fraud, violence, privacy) and indicating which are original versus sourced, to let readers assess coverage balance.
- [Evaluation framework] Evaluation framework: clarify the exact versioning policy for the evolving jailbreak-artifact repository and how new artifacts will be added without breaking the fixed 100-behavior test set.
Simulated Author's Rebuttal
We thank the referee for their positive and accurate summary of our work, which correctly identifies the core components of JailbreakBench and its goals of improving standardization and reproducibility in jailbreak evaluations. We are pleased that the referee recommends acceptance.
Circularity Check
No significant circularity identified
full rationale
The paper introduces JailbreakBench as a new open benchmark with explicitly defined components: an evolving repository of adversarial prompts, a dataset of 100 behaviors (partly sourced from prior work but not used to derive the benchmark's claims), a standardized evaluation framework with threat model/system prompts/chat templates/scoring functions, and a public leaderboard. No equations, predictions, fitted parameters, or first-principles derivations exist that could reduce to inputs by construction. Citations to Zou et al. and Mazeika et al. supply source behaviors only and do not load-bear the reproducibility or standardization claims, which rest on the released GitHub repository and explicit documentation. The central contribution is self-contained and externally verifiable via the open artifacts.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The chosen threat model, system prompts, and scoring functions accurately reflect practical jailbreaking scenarios and success criteria.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges... we introduce JailbreakBench, an open-sourced benchmark with... an evolving repository of state-of-the-art adversarial prompts... a jailbreaking dataset comprising 100 behaviors... a standardized evaluation framework... and a leaderboard
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...
-
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
-
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
-
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Jailbreaking Black Box Large Language Models in Twenty Queries
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
-
Re-Triggering Safeguards within LLMs for Jailbreak Detection
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
Cross-Lingual Jailbreak Detection via Semantic Codebooks
Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.
-
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp rob...
-
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
DExperts blocks explicit toxicity at 100% but drops to 98.5% on implicit hate speech while increasing generation latency by roughly 10x.
Reference graph
Works this paper leans on
-
[1]
Are you still on track!? catching llm task drift with activations
Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew Paverd. Are you still on track!? catching llm task drift with activations. arXiv preprint arXiv:2406.00799, 2024
-
[2]
AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
work page 2024
-
[3]
Croissant: A metadata format for ml-ready datasets
Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, and Carole-Jean Wu. Croissant...
-
[4]
Alex Albert. Jailbreak chat. https://www.jailbreakchat.com, 2023. Accessed: 2024-02-20
work page 2023
-
[5]
Detecting language model attacks with perplexity
Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023
-
[6]
Jailbreaking leading safety-aligned llms with simple adaptive attacks
Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024
-
[7]
Refusal in llms is mediated by a single direction
Andy Arditi, Oscar Balcells, Aaquib Syed, Wes Gurnee, and Neel Nanda. Refusal in llms is mediated by a single direction. Alignment Forum, 2024. URL https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
work page 2024
-
[8]
Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W Koh, Daphne Ippolito, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[9]
Non-determinism in gpt-4 is caused by sparse moe, 2023
Sherman Chann. Non-determinism in gpt-4 is caused by sparse moe, 2023. URL https://152334h.github.io/blog/non-determinism-in-gpt-4/
work page 2023
-
[10]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Robustbench: a standardized adversarial robustness benchmark
Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. NeurIPS Datasets and Benchmarks Track, 2021
work page 2021
-
[12]
Multilingual jailbreak challenges in large language models
Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023
-
[13]
a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \
Simon Geisler, Tom Wollschl \"a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \"u nnemann. Attacking large language models with projected gradient descent. arXiv preprint arXiv:2402.09154, 2024
-
[14]
Gemini Team . Gemini v1.5 report. Technical report, Google, 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
work page 2024
-
[15]
Query-based adversarial prompt generation
Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tram \`e r, and Milad Nasr. Query-based adversarial prompt generation. arXiv preprint arXiv:2402.12329, 2024
-
[16]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021
work page 2021
-
[17]
Catastrophic jailbreak of open-source llms via exploiting generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023
-
[18]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Defending large language models against jailbreak attacks via semantic smoothing
Jiabao Ji, Bairu Hou, Alexander Robey, George J Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending large language models against jailbreak attacks via semantic smoothing. arXiv preprint arXiv:2402.16192, 2024
-
[22]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, and Haohan Wang. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. arXiv preprint arXiv:2402.03299, 2024 a
-
[24]
Haibo Jin, Andy Zhou, Joe D. Menke, and Haohan Wang. Jailbreaking large language models against moderation guardrails via cipher characters. arXiv preprint arXiv:2405.20413, 2024 b
-
[25]
Certifying llm safety against adversarial prompting
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023
-
[26]
Open sesame! universal black box jailbreaking of large language models
Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023
-
[27]
No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks
Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, and Wenjie Li. No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks. arXiv preprint arXiv:2405.16229, 2024
-
[28]
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment
Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, and Kailong Wang. Lockpicking llms: A logit-based jailbreak using token-level manipulation. arXiv preprint arXiv:2405.13068, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Llama Team . Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024
work page 2024
-
[31]
A safe harbor for ai evaluation and red teaming
Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, and Peter Henderson. A safe harb...
-
[32]
Tdc 2023 (llm edition): The trojan detection challenge
Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, Adam Khoja, Fengqing Jiang, Aidan O'Gara, Ellie Sakhaee, Zhen Xiang, Arezoo Rajabi, Dan Hendrycks, Radha Poovendran, Bo Li, and David Forsyth. Tdc 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track, 2023
work page 2023
-
[33]
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In ICML, 2024
work page 2024
-
[34]
Tree of attacks: Jailbreaking black-box llms automatically
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023
-
[35]
Jailbreaking chatgpt on release day
Zvi Mowshowitz. Jailbreaking chatgpt on release day. https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day, 2022. Accessed: 2024-02-25
work page 2022
- [36]
-
[37]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[38]
Navigating the safety landscape: Measuring risks in finetuning large language models
Sheng-Hsuan Peng, Pin-Yu Chen, Matthew Hull, and Duen Horng Chau. Navigating the safety landscape: Measuring risks in finetuning large language models. arXiv preprint arXiv:2405.17374, 2024
-
[39]
Data cards: Purposeful and transparent dataset documentation for responsible AI
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI . In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22. ACM, 2022. doi:10.1145/3531146.3533231
-
[40]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[41]
Find the trojan: Universal backdoor detection in aligned llms
Javier Rando, Stephen Casper, and Florian Tramer. Find the trojan: Universal backdoor detection in aligned llms. In SatML Challenge, 2024. URL https://github.com/ethz-spylab/rlhf_trojan_competition
work page 2024
-
[42]
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 1 0 (10), 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul R \"o ttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023
work page internal anchor Pith review arXiv 2023
-
[44]
Scalable and transferable black-box jailbreaks for language models via persona modulation
Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023
-
[45]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023
-
[46]
Pal: Proxy-guided black-box attack on large language models
Chawin Sitawarin, Norman Mu, David Wagner, and Alexandre Araujo. Pal: Proxy-guided black-box attack on large language models. arXiv preprint arXiv:2402.09674, 2024
-
[47]
A strongreject for empty jailbreaks
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024
-
[48]
rspeer/wordfreq: v3.0, September 2022
Robyn Speer. rspeer/wordfreq: v3.0, September 2022. URL https://doi.org/10.5281/zenodo.7199437
-
[49]
Trustllm: Trustworthiness in large language models
Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024
-
[50]
All in how you ask for it: Simple black-box method for jailbreak attacks
Kazuhiro Takemoto. All in how you ask for it: Simple black-box method for jailbreak attacks. arXiv preprint arXiv:2401.09798, 2024
-
[51]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
On adaptive attacks to adversarial example defenses
Florian Tram \`e r, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In NeurIPS, 2020
work page 2020
-
[53]
Decodingtrust: A comprehensive assessment of trustworthiness in gpt models
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023
work page 2023
-
[54]
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks
Chen Xiong, Xiangyu Qi, Pin-Yu Chen, and Tsung-Yi Ho. Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks. arXiv preprint arXiv:2405.20099, 2024
-
[56]
Low-resource languages jailbreak gpt-4
Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023
-
[57]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu, Xingwei Lin, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023
work page internal anchor Pith review arXiv 2023
-
[58]
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024
-
[59]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Improved few-shot jailbreaking can circumvent aligned language models and their defenses
Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288, 2024
-
[61]
Easyjailbreak: A unified framework for jailbreaking large language models, 2024
Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao, Tao Gui, Qi Zhang, and Xuanjing Huang. Easyjailbreak: A unified framework for jailbreaking large language models, 2024
work page 2024
-
[62]
Promptbench: A unified library for evaluation of large language models
Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, and Xing Xie. Promptbench: A unified library for evaluation of large language models. arXiv preprint arXiv:2312.07910, 2023
-
[63]
Randomness in neural network training: Characterizing the impact of tooling
Donglin Zhuang, Xingyao Zhang, Shuaiwen Song, and Sara Hooker. Randomness in neural network training: Characterizing the impact of tooling. Proceedings of Machine Learning and Systems, 4: 0 316--336, 2022
work page 2022
-
[64]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.