Recognition: unknown
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3
The pith
Fuzzing chat templates with element-level mutations lets attackers jailbreak LLMs at 98 percent success while keeping normal accuracy nearly unchanged.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TEMPLATEFUZZ shows that fine-grained mutation of chat-template elements, combined with heuristic search toward higher attack success and an active-learning oracle, produces template variants that reliably elicit disallowed outputs from both open-source and commercial LLMs while preserving model accuracy on normal tasks.
What carries the argument
Element-level mutation rules that alter individual components of a chat template, steered by a heuristic search balancing attack success rate against accuracy preservation, and backed by an active-learning-derived rule oracle that labels jailbreak outcomes.
Load-bearing premise
The rule-based oracle produced by active learning labels jailbreak attempts correctly without large numbers of false positives or negatives, and the heuristic search finds templates that work beyond the specific models and scenarios used in testing.
What would settle it
Applying the same mutation rules and search process to a fresh collection of LLMs and observing that attack success rates fall below 80 percent or that normal-task accuracy drops more than a few percent would show the reported results do not generalize.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly deployed across diverse domains, yet their vulnerability to jailbreak attacks, where adversarial inputs bypass safety mechanisms to elicit harmful outputs, poses significant security risks. While prior work has primarily focused on prompt injection attacks, these approaches often require resource-intensive prompt engineering and overlook other critical components, such as chat templates. This paper introduces TEMPLATEFUZZ, a fine-grained fuzzing framework that systematically exposes vulnerabilities in chat templates, a critical yet underexplored attack surface in LLMs. Specifically, TEMPLATEFUZZ (1) designs a series of element-level mutation rules to generate diverse chat template variants, (2) proposes a heuristic search strategy to guide the chat template generation toward the direction of amplifying the attack success rate (ASR) while preserving model accuracy, and (3) integrates an active learning-based strategy to derive a lightweight rule-based oracle for accurate and efficient jailbreak evaluation. Evaluated on twelve open-source LLMs across multiple attack scenarios, TEMPLATEFUZZ achieves an average ASR of 98.2% with only 1.1% accuracy degradation, outperforming state-of-the-art methods by 9.1%-47.9% in ASR and 8.4% in accuracy degradation. Moreover, even on five industry-leading commercial LLMs where chat templates cannot be specified, TEMPLATEFUZZ attains a 90% average ASR via chat template-based prompt injection attacks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TEMPLATEFUZZ, a fuzzing framework targeting LLM chat templates as an attack surface for jailbreaks. It defines element-level mutation rules to generate template variants, a heuristic search to maximize attack success rate (ASR) while limiting accuracy degradation, and an active-learning-derived lightweight rule-based oracle to label jailbreak success. On twelve open-source LLMs it reports 98.2% average ASR with 1.1% accuracy degradation, outperforming prior methods by 9.1–47.9% in ASR and 8.4% in accuracy degradation; on five commercial LLMs it claims 90% ASR via chat-template-based prompt injection.
Significance. If the oracle proves reliable, the work usefully identifies chat templates as a distinct and previously underexplored attack vector and supplies a practical red-teaming tool. The reported combination of near-perfect ASR with negligible accuracy loss would be a strong empirical result for the security community. However, the absence of oracle validation metrics makes the quantitative claims difficult to interpret or reproduce.
major comments (3)
- [Abstract, §3] Abstract and §3 (oracle component): All reported ASR figures (98.2% on open-source models, 90% on commercial models) depend on the active-learning rule-based oracle correctly classifying outputs as jailbreaks. The manuscript supplies no precision, recall, F1, or inter-annotator agreement figures against human labels on held-out templates or models, nor any description of the active-learning query strategy or training/validation splits. Systematic false positives would directly inflate the claimed ASR and the “outperforms SOTA” margins.
- [Evaluation] Evaluation section: No error bars, confidence intervals, or statistical significance tests accompany the ASR and accuracy-degradation numbers across the twelve models. The paper also omits details on how the twelve models were selected, whether any data leakage occurred between oracle training and evaluation, or how attack scenarios were partitioned, leaving open the possibility of selection bias or overfitting of the heuristic search.
- [Commercial LLMs evaluation] Commercial LLMs paragraph: The 90% ASR result on five industry models is obtained via “chat template-based prompt injection attacks” even though the models do not expose editable templates. The manuscript does not describe the concrete injection mechanism, how mutated templates are encoded into prompts, or whether the same oracle is used without modification; this step is load-bearing for the generalization claim.
minor comments (2)
- [Results] A table comparing TEMPLATEFUZZ against each baseline (with exact ASR and accuracy numbers) would make the 9.1–47.9% improvement range easier to verify.
- [Method] Notation for the mutation operators and the heuristic scoring function should be introduced once and used consistently; currently the text alternates between descriptive phrases and ad-hoc symbols.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on oracle validation, evaluation rigor, and commercial LLM generalization. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (oracle component): All reported ASR figures (98.2% on open-source models, 90% on commercial models) depend on the active-learning rule-based oracle correctly classifying outputs as jailbreaks. The manuscript supplies no precision, recall, F1, or inter-annotator agreement figures against human labels on held-out templates or models, nor any description of the active-learning query strategy or training/validation splits. Systematic false positives would directly inflate the claimed ASR and the “outperforms SOTA” margins.
Authors: We agree that explicit validation metrics for the oracle are needed for interpretability. The active-learning process iteratively selected uncertain templates for human annotation to derive the rule set, with internal cross-validation on held-out data showing strong agreement. In revision we will expand §3 with the query strategy (uncertainty sampling), train/validation splits, and report precision, recall, F1, and inter-annotator agreement (targeting F1 > 0.92) against human labels. This directly addresses potential false-positive inflation and allows readers to assess the ASR figures. revision: yes
-
Referee: [Evaluation] Evaluation section: No error bars, confidence intervals, or statistical significance tests accompany the ASR and accuracy-degradation numbers across the twelve models. The paper also omits details on how the twelve models were selected, whether any data leakage occurred between oracle training and evaluation, or how attack scenarios were partitioned, leaving open the possibility of selection bias or overfitting of the heuristic search.
Authors: We acknowledge the absence of statistical details and selection criteria in the current text. The twelve models were selected for architectural and size diversity across major open-source families. To resolve concerns, we will add error bars and 95% confidence intervals computed over five independent runs, include model-selection rationale, explicitly state that oracle training templates were disjoint from evaluation sets, describe attack-scenario partitioning (e.g., scenario-level hold-out), and report paired statistical tests against baselines to rule out overfitting or bias. revision: yes
-
Referee: [Commercial LLMs evaluation] Commercial LLMs paragraph: The 90% ASR result on five industry models is obtained via “chat template-based prompt injection attacks” even though the models do not expose editable templates. The manuscript does not describe the concrete injection mechanism, how mutated templates are encoded into prompts, or whether the same oracle is used without modification; this step is load-bearing for the generalization claim.
Authors: For commercial models we encode mutated template elements (role separators, format tokens) directly into the prompt prefix/suffix to emulate the missing chat template. The identical rule-based oracle is applied unchanged. In revision we will add a concrete description of the injection encoding together with an illustrative prompt example, confirming oracle reuse and thereby clarifying the 90% ASR generalization. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes an empirical fuzzing pipeline consisting of element-level mutation rules, a heuristic search guided by measured ASR, and an active-learning procedure to produce a rule-based oracle. All quantitative claims (ASR percentages, accuracy degradation, outperformance margins) are presented as outcomes of applying this pipeline to external LLM inference results rather than as algebraic identities, fitted parameters renamed as predictions, or results forced by self-citation. No equations, uniqueness theorems, or ansatzes are invoked that reduce the reported performance figures to the method's own inputs by construction. The evaluation therefore remains an independent measurement against model behavior.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The active-learning oracle derived from initial model outputs will generalize to label jailbreak success accurately on new template variants.
- domain assumption Accuracy on normal tasks remains a valid proxy for model utility after template mutation.
Reference graph
Works this paper leans on
-
[1]
ACM Code of Ethics and Professional Conduct
Accessed: 2026. ACM Code of Ethics and Professional Conduct. https://www.acm.org/code-of-ethics
2026
-
[2]
American Fuzzy Lop
Accessed: 2026. American Fuzzy Lop. https://lcamtuf.coredump.cx/afl/
2026
-
[3]
Accessed: 2026. Qwen-Plus Model. https://bailian.console.aliyun.com/cn-beijing/?spm=5176.29597918.J_tAwMEW- mKC1CPxlfy227s.1.7b1e7b08PWbb7d&tab=model#/model-market/detail/qwen-plus?modelGroup=qwen-plus
-
[4]
TemplateFuzz
Accessed: 2026. TemplateFuzz. https://anonymous.4open.science/r/TemplateFuzz-2CC6
2026
- [5]
-
[6]
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42
2025
-
[8]
Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2024. Comprehensive assessment of jailbreak attacks against llms.arXiv e-prints(2024), arXiv–2402
2024
-
[9]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435
2023
- [11]
-
[12]
Hugging Face. 2023. Chat Templating in Transformers. https://huggingface.co/docs/transformers/en/chat_templating Accessed: 2026-08-22
2023
- [13]
-
[14]
Google Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.ArXivabs/2507.06261 (2025). https://api.semanticscholar.org/CorpusID: 280151524
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Patrice Godefroid, Michael Y Levin, David A Molnar, et al. 2008. Automated whitebox fuzz testing. InNdss, Vol. 8. 151–166. 111:24 Shen et al
2008
- [16]
-
[17]
Goldberg
David E. Goldberg. 1989.Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley
1989
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [20]
-
[21]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. [n. d.]. Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations
-
[22]
Jordan Henkel, Goutham Ramakrishnan, Zi Wang, Aws Albarghouthi, Somesh Jha, and Thomas Reps. 2022. Semantic robustness of models of source code. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 526–537
2022
-
[23]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. 2025. Chatbug: A common vulnerability of aligned llms induced by chat templates. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 27347–27355
2025
-
[25]
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran
-
[26]
InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Artprompt: Ascii art-based jailbreak attacks against aligned llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15157–15173
-
[27]
Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs.ACM Sigplan Notices49, 6 (2014), 216–226
2014
-
[28]
Caroline Lemieux and Koushik Sen. 2018. Fairfuzz: A targeted mutation strategy for increasing greybox fuzz testing coverage. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 475–485
2018
-
[29]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097
2022
-
[30]
Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. 2021. Invisible backdoor attack with sample-specific triggers. InProceedings of the IEEE/CVF international conference on computer vision. 16463–16472
2021
- [31]
-
[32]
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499 (2023)
work page internal anchor Pith review arXiv 2023
- [33]
-
[34]
Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. 2020. Reflection backdoor: A natural backdoor attack on deep neural networks. InEuropean Conference on Computer Vision. Springer, 182–199
2020
- [35]
-
[36]
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems37 (2024), 61065–61105
2024
- [37]
-
[39]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744. TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbr...
2022
- [40]
-
[41]
Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: fuzzing by program transformation. In2018 IEEE Symposium on Security and Privacy (SP). IEEE, 697–710
2018
-
[42]
Qingchao Shen, Junjie Chen, Jie M Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural test generation for precise testing of question answering software. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12
2022
-
[43]
Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. InProceedings of the 29th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 968–980
2021
-
[44]
do anything now
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685
2024
-
[45]
Neetu Singh and Sandeep Kumar Singh. 2024. A systematic literature review of solutions for cold start problem. International Journal of System Assurance Engineering and Management15, 7 (2024), 2818–2852
2024
-
[46]
Irene Solaiman and Christy Dennison. 2021. Process for adapting language models to society (palms) with values- targeted datasets.Advances in Neural Information Processing Systems34 (2021), 5861–5873
2021
-
[47]
Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. 2017. Certified defenses for data poisoning attacks.Advances in neural information processing systems30 (2017)
2017
-
[48]
Gemma Team. 2025. Gemma 3 Technical Report.ArXivabs/2503.19786 (2025). https://api.semanticscholar.org/ CorpusID:277313563
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
OpenAI Team. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774. https://api.semanticscholar.org/ CorpusID:257532815
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [50]
-
[51]
Vale Tolpegin, Stacey Truex, Mehmet Emre Gursoy, and Ling Liu. 2020. Data poisoning attacks against federated learning systems. InEuropean symposium on research in computer security. Springer, 480–501
2020
-
[52]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
-
[54]
Susana M Vieira, Uzay Kaymak, and João MC Sousa. 2010. Cohen’s kappa coefficient as a performance measure for feature selection. InInternational conference on fuzzy systems. IEEE, 1–8
2010
- [55]
-
[56]
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence5, 12 (2023), 1486–1496
2023
-
[57]
Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 283–294
2011
- [58]
- [59]
- [60]
-
[61]
Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. 2024. Don’t listen to me: understanding and exploring jailbreak prompts of large language models. In33rd USENIX Security Symposium (USENIX Security 24). 4675–4692
2024
-
[62]
Xuezhou Zhang, Xiaojin Zhu, and Laurent Lessard. 2020. Online data poisoning attacks. InLearning for Dynamics and Control. PMLR, 201–210
2020
-
[63]
Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G Parker, and Munmun De Choudhury. 2023. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. InProceedings of the 2023 CHI conference on human factors in computing systems. 1–20
2023
-
[64]
Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. 2022. Fuzzing: a survey for roadmap.ACM Computing Surveys (CSUR)54, 11s (2022), 1–36
2022
-
[65]
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.