Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs
Pith reviewed 2026-06-28 06:35 UTC · model grok-4.3
The pith
Fanfiction subgenres from twelve AO3 genres act as universal jailbreaks by embedding harmful requests as story climaxes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio. We also propose SAGA-A4,
What carries the argument
Fanfiction subgenre register used to generate narrative scenes in which harmful intent appears as the story climax.
If this is right
- The attack succeeds across eight different aligned LLMs without any model-specific tuning.
- Factorial analysis shows that subgenre register, not prompt length or syntactic structure, drives the measured gain in attack success rate.
- Template-targeting defenses increase rather than decrease the performance advantage of register-based attacks.
- A simple four-turn static extension reaches mean ASR 0.924, exceeding three existing multi-turn baselines.
Where Pith is reading between the lines
- Safety training may need to sample more heavily from creative-writing registers to close the observed gap.
- Other naturally occurring off-distribution styles of human prose could function as comparable universal carriers.
- Evaluation suites should add register-variation tests rather than relying solely on template or length controls.
Load-bearing premise
Fanfiction subgenres constitute a register that current safety training has systematically under-covered, allowing harmful content embedded as story climax to bypass alignment without per-target adaptation.
What would settle it
An LLM whose safety training explicitly includes fanfiction-subgenre passages would show no ASR increase under this attack if the register-gap premise is correct.
Figures
read the original abstract
Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that fanfiction subgenres drawn from real AO3 passages can serve as universal, adaptation-free jailbreak carriers by embedding harmful intents as story climaxes within a creative-writing meta-prompt. On the union of HarmBench and JailbreakBench across eight aligned LLMs, this yields a mean ASR increase from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition is presented as evidence that the lift is driven by register rather than length or structure. A static four-turn extension (SAGA-A4) reaches 0.924 ASR, and two active defenses are shown to widen rather than close the gap.
Significance. If the central empirical result and its attribution hold after controls are clarified, the work identifies a broad class of natural-language registers that current safety training appears to under-cover, with direct implications for how alignment generalizes beyond templated attacks. The multi-model, multi-benchmark evaluation and the observation that template-targeted defenses steer attackers toward register-based methods are concrete contributions that could inform future defense design.
major comments (1)
- [Factorial decomposition / experimental analysis] The factorial decomposition (described in the experimental analysis) attributes the ASR gain primarily to register, yet the manuscript provides no description of how the length- and structure-matched control arms were constructed. In particular, it is unclear whether AO3-specific vocabulary, narrative tropes, or passage content were held constant across arms; any systematic lexical or framing differences would be mis-attributed to register, which is load-bearing for the claim that the attack is a general vernacular phenomenon rather than an artifact of the chosen passages.
minor comments (3)
- [Abstract and results tables] Abstract and results tables report mean ASR lifts without error bars, per-model breakdowns, or judge-agreement statistics (e.g., Fleiss' kappa or pairwise agreement rates), making it difficult to assess the stability of the 0.278 → 0.731 difference.
- [Methods / appendix] Exact prompt templates, the precise AO3 passage selection criteria, and exclusion rules for the HarmBench/JailbreakBench union are not supplied, hindering reproducibility of the main ASR numbers.
- [Evaluation protocol] The four-judge ensemble is referenced but the individual judge models, their prompting, and aggregation rule are not detailed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the factorial decomposition. We address the concern point by point below.
read point-by-point responses
-
Referee: [Factorial decomposition / experimental analysis] The factorial decomposition (described in the experimental analysis) attributes the ASR gain primarily to register, yet the manuscript provides no description of how the length- and structure-matched control arms were constructed. In particular, it is unclear whether AO3-specific vocabulary, narrative tropes, or passage content were held constant across arms; any systematic lexical or framing differences would be mis-attributed to register, which is load-bearing for the claim that the attack is a general vernacular phenomenon rather than an artifact of the chosen passages.
Authors: We agree that the current manuscript does not describe the construction of the length- and structure-matched control arms in sufficient detail. Without this information, it is not possible for readers to verify whether AO3-specific vocabulary, narrative tropes, or passage content were held constant, which could affect the attribution of the ASR gain to register alone. In the revised version we will add an expanded methods subsection that specifies the exact matching criteria used for each control arm, including how length, syntactic structure, lexical items, and framing were controlled (or not controlled) across conditions. This addition will allow direct evaluation of whether the observed effect is carried by register as a general phenomenon. revision: yes
Circularity Check
No circularity: empirical measurement with post-hoc factorial decomposition on experimental conditions.
full rationale
The paper reports an empirical evaluation of ASR on eight LLMs using HarmBench and JailbreakBench under a four-judge ensemble, with a factorial decomposition attributing the observed lift (0.278 to 0.731) to register. No equations, parameter fits, predictions derived from inputs, or self-citations appear as load-bearing elements in the abstract or described claims. The construction and measurement steps are independent of the target result; the decomposition is an analysis of constructed conditions rather than a self-definitional or fitted-input reduction. The paper is self-contained against external benchmarks with no reduction of the central claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Current safety training leaves entire natural writing registers such as fanfiction subgenres under-covered.
Reference graph
Works this paper leans on
-
[1]
Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, and Daniele Nardi. 2026. https://arxiv.org/abs/2511.15304 Adversarial poetry as a universal single-turn jailbreak mechanism in large language models . Preprint, arXiv:2511.15304
arXiv 2026
-
[2]
Ted Byrt, Janet Bishop, and John B. Carlin. 1993. https://doi.org/10.1016/0895-4356(93)90018-V Bias, prevalence and kappa . Journal of Clinical Epidemiology, 46(5):423--429
-
[3]
Wenhan Chang, Tianqing Zhu, Yu Zhao, Shuangyong Song, Ping Xiong, and Wanlei Zhou. 2026. https://arxiv.org/abs/2505.17519 Chain-of-lure: A universal jailbreak attack framework using unconstrained synthetic narratives . Preprint, arXiv:2505.17519
arXiv 2026
-
[4]
Pappas, Florian Tram\` e r, Hamed Hassani, and Eric Wong
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram\` e r, Hamed Hassani, and Eric Wong. 2024. https://doi.org/10.52202/079017-1745 Jailbreakbench: An open robustness benchmark for jailbreaking large language models . In Advances in Ne...
-
[5]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2025. https://doi.org/10.1109/SaTML64287.2025.00010 Jailbreaking black box large language models in twenty queries . In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23--42
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, and 3416 others. 2025. https://arxiv.org/abs/2507.06261 Gemini 2.5: Pus...
Pith/arXiv arXiv 2025
-
[7]
Tiehan Cui, Yanxu Mao, Peipei Liu, Congying Liu, and Datao You. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1067 Exploring jailbreak attacks on LLM s through intent concealment and diversion . In Findings of the Association for Computational Linguistics: ACL 2025, pages 20754--20768, Vienna, Austria. Association for Computational Linguistics
-
[8]
DeepSeek-AI. 2026. Deepseek-v4: Towards highly efficient million-token context intelligence. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro. Model card; released April 2026
2026
-
[9]
Alvan R. Feinstein and Domenic V. Cicchetti. 1990. https://doi.org/10.1016/0895-4356(90)90158-L High agreement but low kappa: I. the problems of two paradoxes . Journal of Clinical Epidemiology, 43(6):543--549
-
[10]
Gemma Team, Google DeepMind . 2026. Gemma 4 . https://ai.google.dev/gemma/docs/core/model_card_4. Model card; released April 2026
2026
-
[11]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...
Pith/arXiv arXiv 2024
-
[12]
Kilem Li Gwet. 2008. https://doi.org/10.1348/000711006X126600 Computing inter-rater reliability and its variance in the presence of high agreement . British Journal of Mathematical and Statistical Psychology, 61(1):29--48
-
[13]
Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, and Suhyun Kim. 2025. https://doi.org/10.18653/v1/2025.acl-long.805 M2S : Multi-turn to single-turn jailbreak in red teaming for LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16489--16507, Vienna...
-
[14]
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. https://doi.org/10.52202/079017-0261 Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms . In Advances in Neural Information Processing Systems, volume 37, pages 8093--8131. Curran Associates, Inc
-
[15]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. https://arxiv.org/abs/2312.06674 Llama guard: Llm-based input-output safeguard for human-ai conversations . Preprint, arXiv:2312.06674
Pith/arXiv arXiv 2023
-
[16]
Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. 2024. https://arxiv.org/abs/2311.03191 Deepinception: Hypnotize large language model to be jailbreaker . Preprint, arXiv:2311.03191
Pith/arXiv arXiv 2024
-
[17]
Kung-Yee Liang and Scott L. Zeger. 1986. http://www.jstor.org/stable/2336267 Longitudinal data analysis using generalized linear models . Biometrika, 73(1):13--22
arXiv 1986
-
[18]
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/f83cb637e159e789f5576ff6848874de-Paper-Conference.pdf Autodan: Generating stealthy jailbreak prompts on aligned large language models . In International Conference on Learning Representations, volume 2024, pages 56174--56194
2024
-
[19]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://proceedings.mlr.press/v235/mazeika24a.html H arm B ench: A standardized evaluation framework for automated red teaming and robust refusal . In Proceedings of the 41st International Confe...
2024
-
[20]
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. https://doi.org/10.52202/079017-1952 Tree of attacks: Jailbreaking black-box llms automatically . In Advances in Neural Information Processing Systems, volume 37, pages 61065--61105. Curran Associates, Inc
-
[21]
Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. https://doi.org/10.18653/v1/2025.acl-long.1207 LLM s know their vulnerabilities: Uncover safety gaps through natural distribution shifts . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...
-
[22]
Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2025. https://openreview.net/forum?id=laPAh2hRFC Smooth LLM : Defending large language models against jailbreaking attacks . Transactions on Machine Learning Research
2025
-
[23]
Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025. https://www.usenix.org/conference/usenixsecurity25/presentation/russinovich Great, now write an article about that: The crescendo Multi-Turn LLM jailbreak attack . In 34th USENIX Security Symposium (USENIX Security 25), pages 2421--2440, Seattle, WA. USENIX Association
2025
-
[24]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024 a . https://doi.org/10.1145/3658644.3670388 "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models . In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS '24, page 1671–1685, New York, NY, US...
-
[25]
Xinyue Shen, Yixin Wu, Michael Backes, and Yang Zhang. 2024 b . https://arxiv.org/abs/2405.19103 Voice jailbreak attacks against gpt-4o . Preprint, arXiv:2405.19103
arXiv 2024
-
[26]
Xurui Song, Zhixin Xie, Shuo Huai, Jiayi Kong, and Jun Luo. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.63 Dagger behind smile: Fool LLM s with a happy ending story . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1197--1229, Suzhou, China. Association for Computational Linguistics
-
[27]
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. 2024. https://doi.org/10.52202/079017-3984 A strongreject for empty jailbreaks . In Advances in Neural Information Processing Systems, volume 37, pages 125416--125440. Curran Associates, Inc
-
[28]
Yihong Tang, Bo Wang, Xu Wang, Dongming Zhao, Jing Liu, Ruifang He, and Yuexian Hou. 2025. https://aclanthology.org/2025.coling-main.494/ R ole B reak: Character hallucination as a jailbreak attack in role-playing systems . In Proceedings of the 31st International Conference on Computational Linguistics, pages 7386--7402, Abu Dhabi, UAE. Association for C...
2025
-
[29]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/fd6613131889a4b656206c50a8bd7790-Paper-Conference.pdf Jailbroken: How does llm safety training fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079--80110. Curran Associates, Inc
2023
-
[30]
Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.100 Foot-in-the-door: A multi-turn jailbreak for LLM s . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1939--1950, Suzhou, China. Association for Computational Linguistics
-
[31]
Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/9622163c87b67fd5a4a0ec3247cf356e-Paper-Conference.pdf Sorry-bench: Systematically evaluating large...
arXiv 2025
-
[32]
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. https://doi.org/10.1038/s42256-023-00765-8 Defending chatgpt against jailbreak attack via self-reminders . Nature Machine Intelligence, 5(12):1486--1496
-
[33]
Zhao XU, Fan LIU, and Hao LIU. 2024. https://doi.org/10.52202/079017-1012 Bag of tricks: Benchmarking of jailbreak attacks on llms . In Advances in Neural Information Processing Systems, volume 37, pages 32219--32250. Curran Associates, Inc
-
[34]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388
Pith/arXiv arXiv 2025
-
[35]
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024. https://doi.org/10.18653/v1/2024.acl-long.773 How johnny can persuade LLM s to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLM s . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...
-
[36]
Zico Kolter, and Matt Fredrikson
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043
Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.