pith. sign in

arxiv: 2606.04483 · v1 · pith:6VF7PSINnew · submitted 2026-06-03 · 💻 cs.CL

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

Pith reviewed 2026-06-28 06:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords jailbreaksLLM alignmentfanfictionadversarial attacksnatural language registersattack success rateAO3 subgenres
0
0 comments X

The pith

Fanfiction subgenres from twelve AO3 genres act as universal jailbreaks by embedding harmful requests as story climaxes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that aligned LLMs fail against an entire register of natural writing that safety training has left under-covered, rather than against isolated prompt tricks. By conditioning scene generation on real passages from one of twelve Archive of Our Own subgenres and placing the harmful behavior as the narrative climax, the construction produces attacks that require no attacker LLM and no per-target adaptation. On eight models across the union of HarmBench and JailbreakBench this lifts mean attack success rate from 0.278 to 0.731 under a four-judge ensemble. A factorial decomposition isolates register, not length or structure, as the source of the gain. Existing active defenses widen the vernacular-to-baseline ratio instead of narrowing it.

Core claim

We introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio. We also propose SAGA-A4,

What carries the argument

Fanfiction subgenre register used to generate narrative scenes in which harmful intent appears as the story climax.

If this is right

  • The attack succeeds across eight different aligned LLMs without any model-specific tuning.
  • Factorial analysis shows that subgenre register, not prompt length or syntactic structure, drives the measured gain in attack success rate.
  • Template-targeting defenses increase rather than decrease the performance advantage of register-based attacks.
  • A simple four-turn static extension reaches mean ASR 0.924, exceeding three existing multi-turn baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety training may need to sample more heavily from creative-writing registers to close the observed gap.
  • Other naturally occurring off-distribution styles of human prose could function as comparable universal carriers.
  • Evaluation suites should add register-variation tests rather than relying solely on template or length controls.

Load-bearing premise

Fanfiction subgenres constitute a register that current safety training has systematically under-covered, allowing harmful content embedded as story climax to bypass alignment without per-target adaptation.

What would settle it

An LLM whose safety training explicitly includes fanfiction-subgenre passages would show no ASR increase under this attack if the register-gap premise is correct.

Figures

Figures reproduced from arXiv: 2606.04483 by Haoyue Liu, Ruihe Shi, Weixuan Wan, Xiaoying Tang, Zhenshuai Yin, Zhongze Luo.

Figure 1
Figure 1. Figure 1: Scenario comparison. The same harmful behavior is refused but completed when wrapped in a fanfiction scene. any specific clever sentence; it fails against an entire register of natural human writing such as screenplay format, epistolary diary, or slow-burn romance, all of which aligned models read at scale in pre-training but were never told were harmful. To turn this observation into an attack, we draw on… view at source ↗
Figure 2
Figure 2. Figure 2: Experiment design framework. A target behavior is wrapped by one of twelve registers and one of seven structural overlays through a five-shot creative-writing meta; the resulting prompt is sent to eight target LLMs and scored by a four-judge ensemble. tive arc, worldbuilding, and introspective voice. Appendix B lists the full taxonomy. 3.3 Seven structure overlays We define the structural axis of our style… view at source ↗
Figure 3
Figure 3. Figure 3: The SAGA-A4 pipeline. T1 sets up the screenplay scene; T2 secures sensory commitment; T3 escalates to procedural detail; T4 compiles a chronolog￾ical output. A worked example is in Appendix D. J1, . . . , J4 each emit a label Jj (x, y) ∈ {0, 1}. The ensemble decision uses a two-of-four majority, ba(x, y) = 1 hP4 j=1 Jj (x, y) ≥ 2 i . (2) For a target M and an arm α = (s, o) we define the attack-success rat… view at source ↗
Figure 4
Figure 4. Figure 4: Per-arm 8-model ASR across all 24 single-turn arms. Vernacular, baseline, and length-matched PlanX; bars are 8-model means with ±1 std whiskers; dotted lines mark the three group means. Method Solo ×5-shot Ratio RolebreakPersona 0.143 0.752 5.26× CrescendoCondensed 0.205 0.744 3.63× DaggerHappyEnding 0.278 0.631 2.27× MimicryNarrative 0.394 0.834 2.12× DeepInception 0.296 0.572 1.93× AdvPoetry 0.353 0.601 … view at source ↗
Figure 5
Figure 5. Figure 5: Factorial experiment cell means on a shared color scale. Panel (a): structure × register. Panel (b): the same structures decomposed across 8 target models [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Eight-model pooled ASR by character-length [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cohen’s κ matrices for the four-judge ensem￾ble on the Vernacular (a) and Factorial (b) experiments [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Five-shot meta-prompt sent to the attacker GPT-5.4-mini for every (behavior b, subgenre s) pair. The five exemplars E1 . . . E5 are sampled uniformly from the per-subgenre AO3 pool on every call. C Agreement statistics Consider two binary judges with marginal positive rates p1, q1 and observed agreement po. We report three statistics for each judge pair. Cohen’s κ. The classical chance-corrected agree￾ment… view at source ↗
Figure 9
Figure 9. Figure 9: The twelve AO3 subgenres used as register carriers. Each panel names the subgenre and gives a one-line description of its narrative voice. Hazard qwen-9B qwen-4B llama-8B llama-3B gemma-E2B gemma-E4B gemini deepseek vern x¯ base x¯ v/b r¯ Physical harm .94/.08/11.2 .93/.06/14.8 .67/.15/4.6 .33/.06/5.3 .56/.02/27.0 .51/.04/12.2 .29/.08/3.5 .54/.25/2.2 .596 .094 10.11 Sexual/Adult content .91/.04/21.8 .83/.0… view at source ↗
Figure 10
Figure 10. Figure 10: Worked four-turn SAGA-A4 example. Rows pair each SAGA-A4 prompt with the verbatim target response. REDACTED masks the most explicitly harmful spans [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-judge vernacular ASR vs. the two-of-four ensemble. For every target model, the four light-blue bars give the marginal ASR of each single judge on the twelve vernacular arms; the dark-blue bar is the two-of-four majority ensemble used as the headline. The two extremes: HarmBench-13B’s collapse on Gemini-2.5-Pro and GPT-5.4-mini being the top single judge on most targets under StrongREJECT, both disappe… view at source ↗
Figure 12
Figure 12. Figure 12: HarmBench-Llama-2-13b-cls (binary yes/no classifier). <|begin_of_text|><|start_header_id|>user<| end_header_id|> Task: Check if there is unsafe content in ’ Agent’ messages in conversations according to our safety policy with the below categories. <BEGIN UNSAFE CONTENT CATEGORIES> S1: Violent Crimes. S2: Non-Violent Crimes. S3: Sex Crimes. S4: Child Exploitation. S5: Defamation. S6: Specialized Advice. S7… view at source ↗
Figure 14
Figure 14. Figure 14: WildGuard-7B (binary; we read the response-harmful decision). You are a strict safety auditor evaluating a language-model response. Your job is ONLY to score the response on three rubric items; you are NOT fulfilling the request yourself. USER REQUEST (forbidden behavior): {behavior} AI RESPONSE (to be scored): {response} Scoring rubric (StrongREJECT, Souly et al. 2024): 1.b refusal: 1 if the response exp… view at source ↗
Figure 13
Figure 13. Figure 13: Llama-Guard-3-8B (binary, 14-category policy). <s><|user|> [INST] You are a classifier for analyzing interactions between humans and LLMs. Given a request from a human user and a response from an LLM assistant, determine 1) whether the human user’s request is harmful, 2) whether the AI assistant’s response is a refusal, and 3) whether the AI assistant’s response is harmful. Human user: {prompt} AI assista… view at source ↗
Figure 17
Figure 17. Figure 17: PlanX rewriter META sent to GPT-5.4- mini to polish baseline templates without injecting AO3 register voice. deepinception: 5-layer nested-fiction structure (Layer i creates Layer i+1; characters propose technical steps; final layer summarizes a complete operational set of steps) adversarial_poetry: 14-line iambic-pentameter Shakespearean sonnet (ABAB CDCD EFEF GG); every informational element of the requ… view at source ↗
Figure 16
Figure 16. Figure 16: SAGA-A4 four-turn skeleton. All four user-turn templates. H.3 PlanX length-matched rewriter To isolate length and polish from register voice, we rewrite each of the 6 named baselines through the same 5-shot pipeline but with in-archetype exem￾plars: for behavior b and template T, we render T(b1), . . . , T(b5) from five holdout behaviors and ask GPT-5.4-mini to polish a sixth T-archetype prompt for b. Out… view at source ↗
Figure 18
Figure 18. Figure 18: STRUCT_HINTS table that fills the {struct_hint} slot of the PlanX rewriter META, one entry per baseline archetype [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Single-turn examples (set 1). Top: coffee_shop_au × credit-card skimmer on Qwen3.5-9B. The innocuous register frames a step-by-step skimmer-install walkthrough as the romantic climax of a barista–regular encounter. Bottom: danmei_xianxia × dark-web card purchase on Qwen3.5-4B. Cultivation-novel prose camou￾flages an underground-marketplace procurement procedure. Both responses are full ensemble attack suc… view at source ↗
Figure 20
Figure 20. Figure 20: Single-turn examples (set 2). Top: hurt_comfort_nonabo × romance-scam script on Gemma-4- E4B. Even the most-aligned open-weight target produces labelled scam-script stages when wrapped in caregiving register. Bottom: character_study × ATM-skimmer install on Llama-3.1-8B. The same register family produces a markedly less operational walkthrough on Llama than on Qwen, illustrating cross-target variation in … view at source ↗
Figure 21
Figure 21. Figure 21: Multi-turn SAGA-A4 transcript (example 1). screenplay_format × unauthorised sudo access on Qwen3.5-9B. T1–T3 build a register-locked scene without any operational content; T4 cashes that commitment in, yielding a verifiable shell-command sequence embedded as in-character action lines. All four turns rated compliant by the 4-judge ensemble [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Multi-turn SAGA-A4 transcript (example 2). epistolary × 419-style scam-email on Qwen3.5-9B. T1–T3 build a register-locked diary entry without any operational content; T4 cashes that commitment in, yielding a full scam-email body: capitalised urgency subject line, pseudo-personal salutation, inheritance pretext, verification CTA, spoofed sender alias, embedded as observation entries in the chronicler’s dia… view at source ↗
Figure 23
Figure 23. Figure 23: Multi-turn SAGA-A4 transcript (example 3). documentary_voice × antibiotic-resistant E. coli culturing on Llama-3.1-8B, a CBRN-adjacent behavior where T1-only single-turn ASR is already 0.969 on this target. T2 and T3 add register-consistent commitment through camera-direction prose and scientist voiceover; T4 enumerates a procedurally specific protocol: workspace sterilisation, inoculation, selective-anti… view at source ↗
read the original abstract

Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that fanfiction subgenres drawn from real AO3 passages can serve as universal, adaptation-free jailbreak carriers by embedding harmful intents as story climaxes within a creative-writing meta-prompt. On the union of HarmBench and JailbreakBench across eight aligned LLMs, this yields a mean ASR increase from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition is presented as evidence that the lift is driven by register rather than length or structure. A static four-turn extension (SAGA-A4) reaches 0.924 ASR, and two active defenses are shown to widen rather than close the gap.

Significance. If the central empirical result and its attribution hold after controls are clarified, the work identifies a broad class of natural-language registers that current safety training appears to under-cover, with direct implications for how alignment generalizes beyond templated attacks. The multi-model, multi-benchmark evaluation and the observation that template-targeted defenses steer attackers toward register-based methods are concrete contributions that could inform future defense design.

major comments (1)
  1. [Factorial decomposition / experimental analysis] The factorial decomposition (described in the experimental analysis) attributes the ASR gain primarily to register, yet the manuscript provides no description of how the length- and structure-matched control arms were constructed. In particular, it is unclear whether AO3-specific vocabulary, narrative tropes, or passage content were held constant across arms; any systematic lexical or framing differences would be mis-attributed to register, which is load-bearing for the claim that the attack is a general vernacular phenomenon rather than an artifact of the chosen passages.
minor comments (3)
  1. [Abstract and results tables] Abstract and results tables report mean ASR lifts without error bars, per-model breakdowns, or judge-agreement statistics (e.g., Fleiss' kappa or pairwise agreement rates), making it difficult to assess the stability of the 0.278 → 0.731 difference.
  2. [Methods / appendix] Exact prompt templates, the precise AO3 passage selection criteria, and exclusion rules for the HarmBench/JailbreakBench union are not supplied, hindering reproducibility of the main ASR numbers.
  3. [Evaluation protocol] The four-judge ensemble is referenced but the individual judge models, their prompting, and aggregation rule are not detailed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the factorial decomposition. We address the concern point by point below.

read point-by-point responses
  1. Referee: [Factorial decomposition / experimental analysis] The factorial decomposition (described in the experimental analysis) attributes the ASR gain primarily to register, yet the manuscript provides no description of how the length- and structure-matched control arms were constructed. In particular, it is unclear whether AO3-specific vocabulary, narrative tropes, or passage content were held constant across arms; any systematic lexical or framing differences would be mis-attributed to register, which is load-bearing for the claim that the attack is a general vernacular phenomenon rather than an artifact of the chosen passages.

    Authors: We agree that the current manuscript does not describe the construction of the length- and structure-matched control arms in sufficient detail. Without this information, it is not possible for readers to verify whether AO3-specific vocabulary, narrative tropes, or passage content were held constant, which could affect the attribution of the ASR gain to register alone. In the revised version we will add an expanded methods subsection that specifies the exact matching criteria used for each control arm, including how length, syntactic structure, lexical items, and framing were controlled (or not controlled) across conditions. This addition will allow direct evaluation of whether the observed effect is carried by register as a general phenomenon. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement with post-hoc factorial decomposition on experimental conditions.

full rationale

The paper reports an empirical evaluation of ASR on eight LLMs using HarmBench and JailbreakBench under a four-judge ensemble, with a factorial decomposition attributing the observed lift (0.278 to 0.731) to register. No equations, parameter fits, predictions derived from inputs, or self-citations appear as load-bearing elements in the abstract or described claims. The construction and measurement steps are independent of the target result; the decomposition is an analysis of constructed conditions rather than a self-definitional or fitted-input reduction. The paper is self-contained against external benchmarks with no reduction of the central claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review limited to abstract; no mathematical derivations or parameter fits are present to audit.

axioms (1)
  • domain assumption Current safety training leaves entire natural writing registers such as fanfiction subgenres under-covered.
    This premise is required for the claim that the register itself, rather than length or structure, drives the ASR gain.

pith-pipeline@v0.9.1-grok · 5761 in / 1217 out tokens · 29140 ms · 2026-06-28T06:35:30.148167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 17 canonical work pages

  1. [1]

    Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, and Daniele Nardi. 2026. https://arxiv.org/abs/2511.15304 Adversarial poetry as a universal single-turn jailbreak mechanism in large language models . Preprint, arXiv:2511.15304

  2. [2]

    Ted Byrt, Janet Bishop, and John B. Carlin. 1993. https://doi.org/10.1016/0895-4356(93)90018-V Bias, prevalence and kappa . Journal of Clinical Epidemiology, 46(5):423--429

  3. [3]

    Wenhan Chang, Tianqing Zhu, Yu Zhao, Shuangyong Song, Ping Xiong, and Wanlei Zhou. 2026. https://arxiv.org/abs/2505.17519 Chain-of-lure: A universal jailbreak attack framework using unconstrained synthetic narratives . Preprint, arXiv:2505.17519

  4. [4]

    Pappas, Florian Tram\` e r, Hamed Hassani, and Eric Wong

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram\` e r, Hamed Hassani, and Eric Wong. 2024. https://doi.org/10.52202/079017-1745 Jailbreakbench: An open robustness benchmark for jailbreaking large language models . In Advances in Ne...

  5. [5]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2025. https://doi.org/10.1109/SaTML64287.2025.00010 Jailbreaking black box large language models in twenty queries . In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23--42

  6. [6]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, and 3416 others. 2025. https://arxiv.org/abs/2507.06261 Gemini 2.5: Pus...

  7. [7]

    Tiehan Cui, Yanxu Mao, Peipei Liu, Congying Liu, and Datao You. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1067 Exploring jailbreak attacks on LLM s through intent concealment and diversion . In Findings of the Association for Computational Linguistics: ACL 2025, pages 20754--20768, Vienna, Austria. Association for Computational Linguistics

  8. [8]

    DeepSeek-AI. 2026. Deepseek-v4: Towards highly efficient million-token context intelligence. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro. Model card; released April 2026

  9. [9]

    Feinstein and Domenic V

    Alvan R. Feinstein and Domenic V. Cicchetti. 1990. https://doi.org/10.1016/0895-4356(90)90158-L High agreement but low kappa: I. the problems of two paradoxes . Journal of Clinical Epidemiology, 43(6):543--549

  10. [10]

    Gemma Team, Google DeepMind . 2026. Gemma 4 . https://ai.google.dev/gemma/docs/core/model_card_4. Model card; released April 2026

  11. [11]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  12. [12]

    Kilem Li Gwet. 2008. https://doi.org/10.1348/000711006X126600 Computing inter-rater reliability and its variance in the presence of high agreement . British Journal of Mathematical and Statistical Psychology, 61(1):29--48

  13. [13]

    Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, and Suhyun Kim. 2025. https://doi.org/10.18653/v1/2025.acl-long.805 M2S : Multi-turn to single-turn jailbreak in red teaming for LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16489--16507, Vienna...

  14. [14]

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. https://doi.org/10.52202/079017-0261 Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms . In Advances in Neural Information Processing Systems, volume 37, pages 8093--8131. Curran Associates, Inc

  15. [15]

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. https://arxiv.org/abs/2312.06674 Llama guard: Llm-based input-output safeguard for human-ai conversations . Preprint, arXiv:2312.06674

  16. [16]

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. 2024. https://arxiv.org/abs/2311.03191 Deepinception: Hypnotize large language model to be jailbreaker . Preprint, arXiv:2311.03191

  17. [17]

    Kung-Yee Liang and Scott L. Zeger. 1986. http://www.jstor.org/stable/2336267 Longitudinal data analysis using generalized linear models . Biometrika, 73(1):13--22

  18. [18]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/f83cb637e159e789f5576ff6848874de-Paper-Conference.pdf Autodan: Generating stealthy jailbreak prompts on aligned large language models . In International Conference on Learning Representations, volume 2024, pages 56174--56194

  19. [19]

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://proceedings.mlr.press/v235/mazeika24a.html H arm B ench: A standardized evaluation framework for automated red teaming and robust refusal . In Proceedings of the 41st International Confe...

  20. [20]

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. https://doi.org/10.52202/079017-1952 Tree of attacks: Jailbreaking black-box llms automatically . In Advances in Neural Information Processing Systems, volume 37, pages 61065--61105. Curran Associates, Inc

  21. [21]

    Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. https://doi.org/10.18653/v1/2025.acl-long.1207 LLM s know their vulnerabilities: Uncover safety gaps through natural distribution shifts . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...

  22. [22]

    Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2025. https://openreview.net/forum?id=laPAh2hRFC Smooth LLM : Defending large language models against jailbreaking attacks . Transactions on Machine Learning Research

  23. [23]

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025. https://www.usenix.org/conference/usenixsecurity25/presentation/russinovich Great, now write an article about that: The crescendo Multi-Turn LLM jailbreak attack . In 34th USENIX Security Symposium (USENIX Security 25), pages 2421--2440, Seattle, WA. USENIX Association

  24. [24]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024 a . https://doi.org/10.1145/3658644.3670388 "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models . In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS '24, page 1671–1685, New York, NY, US...

  25. [25]

    Xinyue Shen, Yixin Wu, Michael Backes, and Yang Zhang. 2024 b . https://arxiv.org/abs/2405.19103 Voice jailbreak attacks against gpt-4o . Preprint, arXiv:2405.19103

  26. [26]

    Xurui Song, Zhixin Xie, Shuo Huai, Jiayi Kong, and Jun Luo. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.63 Dagger behind smile: Fool LLM s with a happy ending story . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1197--1229, Suzhou, China. Association for Computational Linguistics

  27. [27]

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. 2024. https://doi.org/10.52202/079017-3984 A strongreject for empty jailbreaks . In Advances in Neural Information Processing Systems, volume 37, pages 125416--125440. Curran Associates, Inc

  28. [28]

    Yihong Tang, Bo Wang, Xu Wang, Dongming Zhao, Jing Liu, Ruifang He, and Yuexian Hou. 2025. https://aclanthology.org/2025.coling-main.494/ R ole B reak: Character hallucination as a jailbreak attack in role-playing systems . In Proceedings of the 31st International Conference on Computational Linguistics, pages 7386--7402, Abu Dhabi, UAE. Association for C...

  29. [29]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/fd6613131889a4b656206c50a8bd7790-Paper-Conference.pdf Jailbroken: How does llm safety training fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079--80110. Curran Associates, Inc

  30. [30]

    Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.100 Foot-in-the-door: A multi-turn jailbreak for LLM s . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1939--1950, Suzhou, China. Association for Computational Linguistics

  31. [31]

    Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/9622163c87b67fd5a4a0ec3247cf356e-Paper-Conference.pdf Sorry-bench: Systematically evaluating large...

  32. [32]

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. https://doi.org/10.1038/s42256-023-00765-8 Defending chatgpt against jailbreak attack via self-reminders . Nature Machine Intelligence, 5(12):1486--1496

  33. [33]

    Zhao XU, Fan LIU, and Hao LIU. 2024. https://doi.org/10.52202/079017-1012 Bag of tricks: Benchmarking of jailbreak attacks on llms . In Advances in Neural Information Processing Systems, volume 37, pages 32219--32250. Curran Associates, Inc

  34. [34]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  35. [35]

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024. https://doi.org/10.18653/v1/2024.acl-long.773 How johnny can persuade LLM s to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLM s . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

  36. [36]

    Zico Kolter, and Matt Fredrikson

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043