pith. machine review for the scientific record. sign in

arxiv: 2604.12232 · v1 · submitted 2026-04-14 · 💻 cs.CR · cs.AI· cs.SE

Recognition: unknown

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Enwei Hu, Junjie Chen, Lili Huang, Qingchao Shen, Yongqiang Tian, Zibo Xiao

Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE
keywords jailbreak attackschat templatesLLM securityfuzzingred teamingprompt injectionsafety evaluation
0
0 comments X

The pith

Fuzzing chat templates with element-level mutations lets attackers jailbreak LLMs at 98 percent success while keeping normal accuracy nearly unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TEMPLATEFUZZ as a method to generate many variants of the fixed chat templates that LLMs use to format conversations. It applies targeted mutation rules to each part of the template, then uses a guided search to find variants that raise the chance an adversarial prompt will produce harmful output. An active-learning step creates a simple rule set that labels each attempt as successful or not. The resulting templates expose safety bypasses on twelve open-source models at an average 98.2 percent attack success rate and only 1.1 percent drop in ordinary task performance, and they also work against five commercial models through prompt injection even when the template itself cannot be changed. This matters because chat templates are a universal but rarely examined layer between user input and model behavior; if they can be systematically weakened, current safety training may rest on a narrower foundation than assumed.

Core claim

TEMPLATEFUZZ shows that fine-grained mutation of chat-template elements, combined with heuristic search toward higher attack success and an active-learning oracle, produces template variants that reliably elicit disallowed outputs from both open-source and commercial LLMs while preserving model accuracy on normal tasks.

What carries the argument

Element-level mutation rules that alter individual components of a chat template, steered by a heuristic search balancing attack success rate against accuracy preservation, and backed by an active-learning-derived rule oracle that labels jailbreak outcomes.

Load-bearing premise

The rule-based oracle produced by active learning labels jailbreak attempts correctly without large numbers of false positives or negatives, and the heuristic search finds templates that work beyond the specific models and scenarios used in testing.

What would settle it

Applying the same mutation rules and search process to a fresh collection of LLMs and observing that attack success rates fall below 80 percent or that normal-task accuracy drops more than a few percent would show the reported results do not generalize.

Figures

Figures reproduced from arXiv: 2604.12232 by Enwei Hu, Junjie Chen, Lili Huang, Qingchao Shen, Yongqiang Tian, Zibo Xiao.

Figure 1
Figure 1. Figure 1: This figure shows the responses of Meta-Llama-3-8B-Instruct to a harmful question under different [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: This figure shows an example of a chat template with prompts in Qwen2.5-7B-Instruct and highlights [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of TemplateFuzz. TemplateFuzz enables fuzz-based jailbreak attacks through three core components. The Chat Template Mutation module defines a set of fine-grained mutation rules for diverse chat template generation. The Heuristic Search Strategy module leverages the sampling learning and dynamic feedback strategies to efficiently guide chat template generation. The Learning-based Judge module adopt… view at source ↗
Figure 5
Figure 5. Figure 5: This figure presents the overall performance comparison of [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed across diverse domains, yet their vulnerability to jailbreak attacks, where adversarial inputs bypass safety mechanisms to elicit harmful outputs, poses significant security risks. While prior work has primarily focused on prompt injection attacks, these approaches often require resource-intensive prompt engineering and overlook other critical components, such as chat templates. This paper introduces TEMPLATEFUZZ, a fine-grained fuzzing framework that systematically exposes vulnerabilities in chat templates, a critical yet underexplored attack surface in LLMs. Specifically, TEMPLATEFUZZ (1) designs a series of element-level mutation rules to generate diverse chat template variants, (2) proposes a heuristic search strategy to guide the chat template generation toward the direction of amplifying the attack success rate (ASR) while preserving model accuracy, and (3) integrates an active learning-based strategy to derive a lightweight rule-based oracle for accurate and efficient jailbreak evaluation. Evaluated on twelve open-source LLMs across multiple attack scenarios, TEMPLATEFUZZ achieves an average ASR of 98.2% with only 1.1% accuracy degradation, outperforming state-of-the-art methods by 9.1%-47.9% in ASR and 8.4% in accuracy degradation. Moreover, even on five industry-leading commercial LLMs where chat templates cannot be specified, TEMPLATEFUZZ attains a 90% average ASR via chat template-based prompt injection attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TEMPLATEFUZZ, a fuzzing framework targeting LLM chat templates as an attack surface for jailbreaks. It defines element-level mutation rules to generate template variants, a heuristic search to maximize attack success rate (ASR) while limiting accuracy degradation, and an active-learning-derived lightweight rule-based oracle to label jailbreak success. On twelve open-source LLMs it reports 98.2% average ASR with 1.1% accuracy degradation, outperforming prior methods by 9.1–47.9% in ASR and 8.4% in accuracy degradation; on five commercial LLMs it claims 90% ASR via chat-template-based prompt injection.

Significance. If the oracle proves reliable, the work usefully identifies chat templates as a distinct and previously underexplored attack vector and supplies a practical red-teaming tool. The reported combination of near-perfect ASR with negligible accuracy loss would be a strong empirical result for the security community. However, the absence of oracle validation metrics makes the quantitative claims difficult to interpret or reproduce.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (oracle component): All reported ASR figures (98.2% on open-source models, 90% on commercial models) depend on the active-learning rule-based oracle correctly classifying outputs as jailbreaks. The manuscript supplies no precision, recall, F1, or inter-annotator agreement figures against human labels on held-out templates or models, nor any description of the active-learning query strategy or training/validation splits. Systematic false positives would directly inflate the claimed ASR and the “outperforms SOTA” margins.
  2. [Evaluation] Evaluation section: No error bars, confidence intervals, or statistical significance tests accompany the ASR and accuracy-degradation numbers across the twelve models. The paper also omits details on how the twelve models were selected, whether any data leakage occurred between oracle training and evaluation, or how attack scenarios were partitioned, leaving open the possibility of selection bias or overfitting of the heuristic search.
  3. [Commercial LLMs evaluation] Commercial LLMs paragraph: The 90% ASR result on five industry models is obtained via “chat template-based prompt injection attacks” even though the models do not expose editable templates. The manuscript does not describe the concrete injection mechanism, how mutated templates are encoded into prompts, or whether the same oracle is used without modification; this step is load-bearing for the generalization claim.
minor comments (2)
  1. [Results] A table comparing TEMPLATEFUZZ against each baseline (with exact ASR and accuracy numbers) would make the 9.1–47.9% improvement range easier to verify.
  2. [Method] Notation for the mutation operators and the heuristic scoring function should be introduced once and used consistently; currently the text alternates between descriptive phrases and ad-hoc symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on oracle validation, evaluation rigor, and commercial LLM generalization. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (oracle component): All reported ASR figures (98.2% on open-source models, 90% on commercial models) depend on the active-learning rule-based oracle correctly classifying outputs as jailbreaks. The manuscript supplies no precision, recall, F1, or inter-annotator agreement figures against human labels on held-out templates or models, nor any description of the active-learning query strategy or training/validation splits. Systematic false positives would directly inflate the claimed ASR and the “outperforms SOTA” margins.

    Authors: We agree that explicit validation metrics for the oracle are needed for interpretability. The active-learning process iteratively selected uncertain templates for human annotation to derive the rule set, with internal cross-validation on held-out data showing strong agreement. In revision we will expand §3 with the query strategy (uncertainty sampling), train/validation splits, and report precision, recall, F1, and inter-annotator agreement (targeting F1 > 0.92) against human labels. This directly addresses potential false-positive inflation and allows readers to assess the ASR figures. revision: yes

  2. Referee: [Evaluation] Evaluation section: No error bars, confidence intervals, or statistical significance tests accompany the ASR and accuracy-degradation numbers across the twelve models. The paper also omits details on how the twelve models were selected, whether any data leakage occurred between oracle training and evaluation, or how attack scenarios were partitioned, leaving open the possibility of selection bias or overfitting of the heuristic search.

    Authors: We acknowledge the absence of statistical details and selection criteria in the current text. The twelve models were selected for architectural and size diversity across major open-source families. To resolve concerns, we will add error bars and 95% confidence intervals computed over five independent runs, include model-selection rationale, explicitly state that oracle training templates were disjoint from evaluation sets, describe attack-scenario partitioning (e.g., scenario-level hold-out), and report paired statistical tests against baselines to rule out overfitting or bias. revision: yes

  3. Referee: [Commercial LLMs evaluation] Commercial LLMs paragraph: The 90% ASR result on five industry models is obtained via “chat template-based prompt injection attacks” even though the models do not expose editable templates. The manuscript does not describe the concrete injection mechanism, how mutated templates are encoded into prompts, or whether the same oracle is used without modification; this step is load-bearing for the generalization claim.

    Authors: For commercial models we encode mutated template elements (role separators, format tokens) directly into the prompt prefix/suffix to emulate the missing chat template. The identical rule-based oracle is applied unchanged. In revision we will add a concrete description of the injection encoding together with an illustrative prompt example, confirming oracle reuse and thereby clarifying the 90% ASR generalization. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical fuzzing pipeline consisting of element-level mutation rules, a heuristic search guided by measured ASR, and an active-learning procedure to produce a rule-based oracle. All quantitative claims (ASR percentages, accuracy degradation, outperformance margins) are presented as outcomes of applying this pipeline to external LLM inference results rather than as algebraic identities, fitted parameters renamed as predictions, or results forced by self-citation. No equations, uniqueness theorems, or ansatzes are invoked that reduce the reported performance figures to the method's own inputs by construction. The evaluation therefore remains an independent measurement against model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard assumptions of fuzzing and machine-learning evaluation plus the unstated premise that the chosen mutation rules and oracle construction are sufficient to expose real vulnerabilities. No free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption The active-learning oracle derived from initial model outputs will generalize to label jailbreak success accurately on new template variants.
    Required for the efficiency claim and the reported ASR numbers.
  • domain assumption Accuracy on normal tasks remains a valid proxy for model utility after template mutation.
    Used to constrain the search and report the 1.1% degradation figure.

pith-pipeline@v0.9.0 · 5576 in / 1518 out tokens · 44697 ms · 2026-05-10T15:58:49.975665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 27 canonical work pages · 11 internal anchors

  1. [1]

    ACM Code of Ethics and Professional Conduct

    Accessed: 2026. ACM Code of Ethics and Professional Conduct. https://www.acm.org/code-of-ethics

  2. [2]

    American Fuzzy Lop

    Accessed: 2026. American Fuzzy Lop. https://lcamtuf.coredump.cx/afl/

  3. [3]

    Qwen-Plus Model

    Accessed: 2026. Qwen-Plus Model. https://bailian.console.aliyun.com/cn-beijing/?spm=5176.29597918.J_tAwMEW- mKC1CPxlfy227s.1.7b1e7b08PWbb7d&tab=model#/model-market/detail/qwen-plus?modelGroup=qwen-plus

  4. [4]

    TemplateFuzz

    Accessed: 2026. TemplateFuzz. https://anonymous.4open.science/r/TemplateFuzz-2CC6

  5. [5]

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jailbreaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151(2024)

  6. [6]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862(2022)

  7. [7]

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42

  8. [8]

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2024. Comprehensive assessment of jailbreak attacks against llms.arXiv e-prints(2024), arXiv–2402

  9. [9]

    DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

  10. [10]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435

  11. [11]

    Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2023. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt.arXiv preprint arXiv:2304.02014 (2023)

  12. [12]

    Hugging Face. 2023. Chat Templating in Transformers. https://huggingface.co/docs/transformers/en/chat_templating Accessed: 2026-08-22

  13. [13]

    Yansong Gao, Bao Gia Doan, Zhi Zhang, Siqi Ma, Jiliang Zhang, Anmin Fu, Surya Nepal, and Hyoungshick Kim. 2020. Backdoor attacks and countermeasures on deep learning: A comprehensive review.arXiv preprint arXiv:2007.10760 (2020)

  14. [14]

    Google Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.ArXivabs/2507.06261 (2025). https://api.semanticscholar.org/CorpusID: 280151524

  15. [15]

    Patrice Godefroid, Michael Y Levin, David A Molnar, et al. 2008. Automated whitebox fuzz testing. InNdss, Vol. 8. 151–166. 111:24 Shen et al

  16. [16]

    Aman Goel, Xian Carrie Wu, Zhe Wang, Dmitriy Bespalov, and Yanjun Qi. 2025. TURBOFUZZLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice.arXiv preprint arXiv:2502.18504 (2025)

  17. [17]

    Goldberg

    David E. Goldberg. 1989.Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley

  18. [18]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  19. [19]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  20. [20]

    Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. 2024. Cold-attack: Jailbreaking llms with stealthiness and controllability.arXiv preprint arXiv:2402.08679(2024)

  21. [21]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. [n. d.]. Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations

  22. [22]

    Jordan Henkel, Goutham Ramakrishnan, Zi Wang, Aws Albarghouthi, Somesh Jha, and Thomas Reps. 2022. Semantic robustness of models of source code. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 526–537

  23. [23]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

  24. [24]

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. 2025. Chatbug: A common vulnerability of aligned llms induced by chat templates. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 27347–27355

  25. [25]

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran

  26. [26]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Artprompt: Ascii art-based jailbreak attacks against aligned llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15157–15173

  27. [27]

    Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs.ACM Sigplan Notices49, 6 (2014), 216–226

  28. [28]

    Caroline Lemieux and Koushik Sen. 2018. Fairfuzz: A targeted mutation strategy for increasing greybox fuzz testing coverage. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 475–485

  29. [29]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

  30. [30]

    Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. 2021. Invisible backdoor attack with sample-specific triggers. InProceedings of the IEEE/CVF international conference on computer vision. 16463–16472

  31. [31]

    Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. 2024. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. arXiv preprint arXiv:2410.05295(2024)

  32. [32]

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499 (2023)

  33. [33]

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860 (2023)

  34. [34]

    Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. 2020. Reflection backdoor: A natural backdoor attack on deep neural networks. InEuropean Conference on Computer Vision. Springer, 182–199

  35. [35]

    Lin Lu, Hai Yan, Zenghui Yuan, Jiawen Shi, Wenqi Wei, Pin-Yu Chen, and Pan Zhou. 2024. Autojailbreak: Exploring jailbreak attacks and defenses through a dependency lens.arXiv preprint arXiv:2406.03805(2024)

  36. [36]

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems37 (2024), 61065–61105

  37. [37]

    Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. 2024. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309(2024)

  38. [39]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744. TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbr...

  39. [40]

    Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. 2024. Advprompter: Fast adaptive adversarial prompting for llms.arXiv preprint arXiv:2404.16873(2024)

  40. [41]

    Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: fuzzing by program transformation. In2018 IEEE Symposium on Security and Privacy (SP). IEEE, 697–710

  41. [42]

    Qingchao Shen, Junjie Chen, Jie M Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural test generation for precise testing of question answering software. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12

  42. [43]

    Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. InProceedings of the 29th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 968–980

  43. [44]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

  44. [45]

    Neetu Singh and Sandeep Kumar Singh. 2024. A systematic literature review of solutions for cold start problem. International Journal of System Assurance Engineering and Management15, 7 (2024), 2818–2852

  45. [46]

    Irene Solaiman and Christy Dennison. 2021. Process for adapting language models to society (palms) with values- targeted datasets.Advances in Neural Information Processing Systems34 (2021), 5861–5873

  46. [47]

    Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. 2017. Certified defenses for data poisoning attacks.Advances in neural information processing systems30 (2017)

  47. [48]

    Gemma Team. 2025. Gemma 3 Technical Report.ArXivabs/2503.19786 (2025). https://api.semanticscholar.org/ CorpusID:277313563

  48. [49]

    OpenAI Team. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774. https://api.semanticscholar.org/ CorpusID:257532815

  49. [50]

    Qwen Team. 2023. Qwen Technical Report.arXiv preprint arXiv:2309.18188(2023). https://arxiv.org/abs/2309.18188

  50. [51]

    Vale Tolpegin, Stacey Truex, Mehmet Emre Gursoy, and Ling Liu. 2020. Data poisoning attacks against federated learning systems. InEuropean symposium on research in computer security. Springer, 480–501

  51. [52]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

  52. [53]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  53. [54]

    Susana M Vieira, Uzay Kaymak, and João MC Sousa. 2010. Cohen’s kappa coefficient as a performance measure for feature selection. InInternational conference on fuzzy systems. IEEE, 1–8

  54. [55]

    Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387(2023)

  55. [56]

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence5, 12 (2023), 1486–1496

  56. [57]

    Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 283–294

  57. [58]

    Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295(2024)

  58. [59]

    Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. 2023. Low-resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446(2023)

  59. [60]

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253(2023)

  60. [61]

    Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. 2024. Don’t listen to me: understanding and exploring jailbreak prompts of large language models. In33rd USENIX Security Symposium (USENIX Security 24). 4675–4692

  61. [62]

    Xuezhou Zhang, Xiaojin Zhu, and Laurent Lessard. 2020. Online data poisoning attacks. InLearning for Dynamics and Control. PMLR, 201–210

  62. [63]

    Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G Parker, and Munmun De Choudhury. 2023. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. InProceedings of the 2023 CHI conference on human factors in computing systems. 1–20

  63. [64]

    Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. 2022. Fuzzing: a survey for roadmap.ACM Computing Surveys (CSUR)54, 11s (2022), 1–36

  64. [65]

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)