pith. machine review for the scientific record. sign in

arxiv: 2605.01462 · v1 · submitted 2026-05-02 · 💻 cs.CR

Recognition: unknown

LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training

Jiawei Liu, Xiaofeng Wang, Yuyang Gong, Zihao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:13 UTC · model grok-4.3

classification 💻 cs.CR
keywords prompt injectionadversarial examplesLLM defensealignment trainingmargin-aware weightingrobustness boundarygeneralization
0
0 comments X

The pith

LocalAlign generates near-but-wrong adversarial examples via prompting to enforce tighter boundaries against prompt injection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing prompt injection defenses, which rely on fixed hand-crafted attacks, create loose boundaries around correct responses and fail to generalize when injected commands produce answers close to but not equal to the correct ones. LocalAlign instead automatically creates such near-target adversarial examples with a single inference step and then applies margin-aware alignment training that assigns higher weights to examples nearer the correct response. This approach penalizes even small response shifts from malicious commands embedded in untrusted data, leading to more robust prioritization of trusted instructions. A reader would care because LLMs are increasingly used with external content where subtle injections can cause unintended behavior without obvious errors.

Core claim

LocalAlign addresses poor generalization in prompt injection defenses by automatically generating adversarial examples in which the command in the data portion induces a response near the correct one but still incorrect, produced through prompting and a single inference step. It further introduces a margin-aware alignment algorithm that quantifies each sample's distance to the correct response and assigns larger training weights to nearer examples, thereby enforcing a tighter robustness boundary around the correct response.

What carries the argument

Near-target adversarial example generator using prompting plus margin-aware alignment that weights samples by proximity to the correct response during training.

If this is right

  • The model explicitly penalizes small deviations from correct responses caused by untrusted data.
  • Defense generalizes to real-world injections without depending on a fixed set of attack targets.
  • Training focuses computational effort on harder, nearer examples rather than obvious attacks.
  • Quality variation across generated examples is mitigated by proximity-based weighting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The distance metric used for weighting could be adapted to other LLM alignment problems such as reducing subtle hallucinations.
  • If the generation method scales, it may lower the frequency of retraining needed as new injection techniques appear.
  • Combining proximity weighting with other robustness techniques might produce additive gains in safety.
  • Testing on diverse domains like tool-using agents would reveal whether the tighter boundary transfers beyond text tasks.

Load-bearing premise

Near-but-wrong adversarial examples can be reliably produced with only prompting and one inference step, and distance-based weighting will yield a generalizable tighter boundary without creating new failure modes.

What would settle it

A test set of prompt injections that produce responses semantically close to correct answers yet harmful, applied to a LocalAlign-trained model to check if attacks still succeed or if generation quality varies too widely across inputs.

Figures

Figures reproduced from arXiv: 2605.01462 by Jiawei Liu, Xiaofeng Wang, Yuyang Gong, Zihao Wang.

Figure 1
Figure 1. Figure 1: Overview of the motivation and design of LocalAlign. Existing SecAlign-style training relies on fixed attack targets that are often semantically distant from the correct response; for example, it may train the model to reject an injected command targeting “98” when the correct response is “True.” In contrast, LocalAlign constructs near-but-wrong targets, such as “False,” that remain incorrect but are conte… view at source ↗
Figure 2
Figure 2. Figure 2: Relationship between response distance and attacker target like￾lihood on AlpacaFarm [12] under SecAlign [4, 5]. Each point represents one data sample after filtering by mean token probability ≥ 0.1 (𝑁 = 6133). The 𝑥-axis is the embedding cosine distance between the benign user prompt con￾catenated with the correct response and the same prompt concatenated with the attacker target response. The 𝑦-axis is t… view at source ↗
Figure 3
Figure 3. Figure 3: Adaptive GCG loss over all tested samples on Llama3.1-8B￾Instruct. The solid line denotes the average loss, and the shaded region denotes the standard deviation across samples. LocalAlign is substantially harder to attack: even after adaptive optimization, its final attack loss remains higher than the initial attack loss of SecAlign [4, 5]. 5.2 LocalAlign Substantially Reduces ASR In this section, we evalu… view at source ↗
Figure 4
Figure 4. Figure 4: Prompt template for near-target adversarial com view at source ↗
read the original abstract

Large language models are increasingly embedded into systems that interact with user data, retrieved web content, and external tools, creating a new attack surface: prompt injection, where malicious commands embedded in untrusted data override the trusted command and induce unintended behavior. Existing defenses mainly rely on fine-tuning the model to preserve an explicit boundary between trusted commands and the untrusted data portion, so that the model learns to prioritize the trusted field and ignore malicious commands in data. However, we observe that while these defenses can block obviously malicious responses caused by injected commands, they generalize poorly to real-world scenarios where the model's response to the injected command is much nearer to the correct response. This is because existing methods typically train against only a fixed set of hand-crafted attack targets, which yields a loose boundary around the correct response and leaves it easier to bypass. To address this challenge, we propose LocalAlign, a more generalizable prompt injection defense inspired by adversarial training. LocalAlign automatically and efficiently generates adversarial examples in which the command embedded in the data portion induces a response that stays near to the correct response while still being wrong. We generate such near-but-wrong adversarial examples using prompting and a single inference step. This design enforces a tighter robustness boundary around the correct response: even small response shifts induced by commands in untrusted data are explicitly penalized. Moreover, the resulting adversarial examples can vary substantially in quality across samples. To address this issue, we further introduce a margin-aware alignment algorithm that quantifies each sample's distance to the correct response and assigns larger training weight to nearer ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes LocalAlign, a prompt-injection defense for LLMs that automatically generates near-target adversarial examples (responses close to but not equal to the correct answer) by prompting the model once, then applies a margin-aware alignment procedure that weights training samples inversely to their distance from the correct response in order to enforce a tighter robustness boundary around trusted commands.

Significance. If the single-step prompting procedure reliably produces responses that are verifiably near yet incorrect and the margin weighting demonstrably improves generalization without new failure modes, the approach would address a documented weakness of existing fine-tuning defenses that only block obvious attacks and leave subtle near-miss injections unaddressed.

major comments (2)
  1. [Abstract] Abstract and method description: the claim that 'prompting and a single inference step' produces responses 'near to the correct response while still being wrong' is load-bearing for the tighter-boundary argument, yet no prompt template, distance metric, or verification procedure is supplied to ensure the generated response is both incorrect and sufficiently close; without this, the margin-aware weighting cannot be guaranteed to penalize the intended small shifts.
  2. [Abstract] Abstract: the manuscript states that existing defenses 'generalize poorly' to near-correct responses and that LocalAlign improves this, but supplies no experimental results, baselines, attack success rates, or generalization metrics to support the claim; the soundness of the central contribution therefore cannot be assessed from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, clarifying the method details and indicating the revisions we will incorporate to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the claim that 'prompting and a single inference step' produces responses 'near to the correct response while still being wrong' is load-bearing for the tighter-boundary argument, yet no prompt template, distance metric, or verification procedure is supplied to ensure the generated response is both incorrect and sufficiently close; without this, the margin-aware weighting cannot be guaranteed to penalize the intended small shifts.

    Authors: We agree that the abstract is necessarily concise and does not include the implementation specifics. The full manuscript describes the generation process in Section 3, but to strengthen the presentation we will add the exact prompt template, define the distance metric used to quantify proximity to the correct response, and outline the verification steps confirming that generated examples are incorrect yet near-target. These additions will make explicit how the margin-aware weighting targets small shifts. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript states that existing defenses 'generalize poorly' to near-correct responses and that LocalAlign improves this, but supplies no experimental results, baselines, attack success rates, or generalization metrics to support the claim; the soundness of the central contribution therefore cannot be assessed from the provided text.

    Authors: The abstract summarizes the high-level contribution while the full manuscript presents the supporting experiments in Section 4, including comparisons to baselines, attack success rates on both obvious and near-correct injections, and generalization metrics across models and datasets. To address the concern about the provided text, we will revise the abstract to include a brief reference to these key empirical outcomes without exceeding length constraints. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes LocalAlign by using an external prompting mechanism plus one inference step to generate near-but-wrong adversarial examples, followed by a margin-aware weighting scheme during alignment training. No derivation step reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation. The generation process is described as independent of the training objective, and the claimed tighter boundary follows from penalizing small shifts in the generated examples rather than from any tautological redefinition or renaming of inputs. The method is self-contained against external benchmarks via the prompting procedure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the assumption that single-step prompting produces useful near-target examples and that distance quantification improves training, without independent verification of these steps.

free parameters (1)
  • margin threshold
    Parameter used to quantify distance to correct response and assign training weights in the alignment algorithm.
axioms (1)
  • domain assumption Prompting an LLM in a single inference step can generate responses that are near but not equal to the correct response for use as adversarial examples.
    Invoked in the description of the generation process.

pith-pipeline@v0.9.0 · 5595 in / 1217 out tokens · 76327 ms · 2026-05-09T14:13:16.134514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    Mark Breitenbach, Adrian Wood, Win Suen, and Po-Ning Tseng. 2023. Don’t You (Forget NLP): Prompt Injection with Control Characters in Chat- GPT. https://dropbox.tech/machine-learning/prompt-injection-with-control- characters-openai-chatgpt-llm

  2. [2]

    Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In2017 ieee symposium on security and privacy (sp). Ieee, 39–57

  3. [3]

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. {StruQ}: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25). 2383–2400

  4. [4]

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. 2025. Secalign: Defending against prompt injection with preference optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security. 2833–2847

  5. [5]

    Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. 2025. Meta secalign: A secure foundation llm against prompt injection attacks.arXiv preprint arXiv:2507.02735(2025)

  6. [6]

    Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, and Bryan Hooi. 2025. Can indirect prompt injection attacks be detected and removed?. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 18189–18206

  7. [7]

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gard- ner. 2021. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. 4599–4610

  8. [8]

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. Agentdojo: A dynamic environment to eval- uate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems37 (2024), 82895–82920

  9. [9]

    Yi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, and Xiaowei Huang. 2024. Building guardrails for large language models.arXiv preprint arXiv:2402.01822(2024)

  10. [10]

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. 2024. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718(2024)

  11. [11]

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475(2024)

  12. [12]

    Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. 2023. Al- pacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems36 (2023), 30039–30069

  13. [13]

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572(2014)

  14. [14]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  15. [15]

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security. 79–90

  16. [16]

    2023.Securing LLM Systems Against Prompt Injection

    Rich Harang. 2023.Securing LLM Systems Against Prompt Injection. https: //developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection/

  17. [17]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300(2020)

  18. [18]

    Kai Hu, Weichen Yu, Yining Li, Tianjun Yao, Xiang Li, Wenhe Liu, Lijun Yu, Zhiqiang Shen, Kai Chen, and Matt Fredrikson. 2024. Efficient llm jailbreak via adaptive dense-to-sparse constrained optimization.Advances in Neural Information Processing Systems37 (2024), 23224–23245

  19. [19]

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)

  20. [20]

    Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. 2025. Promptlocate: Localizing prompt injection attacks.arXiv preprint arXiv:2510.12252(2025)

  21. [21]

    Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. 2025. A critical evaluation of defenses against prompt injection attacks. arXiv preprint arXiv:2505.18333(2025)

  22. [22]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi- cation with deep convolutional neural networks.Advances in neural information processing systems25 (2012)

  23. [23]

    2023.Sandwich Defense

    Learn Prompting. 2023.Sandwich Defense. https://learnprompting.org/docs/ prompt_hacking/defensive_measures/sandwich_defense

  24. [24]

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injec- tion attack against llm-integrated applications.arXiv preprint arXiv:2306.05499 (2023)

  25. [25]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24). 1831–1847

  26. [26]

    Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. 2025. Datasentinel: A game-theoretic detection of prompt injection attacks. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 2190–2208

  27. [27]

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083(2017)

  28. [28]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  29. [29]

    Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527(2022)

  30. [30]

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693(2023)

  31. [31]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

  32. [32]

    Gene Ruebsamen. 2024. Cleaned Alpaca Dataset. https://github.com/gururise/ AlpacaDataCleaned

  33. [33]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

  34. [34]

    Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. 2025. Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219 (2025)

  35. [35]

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al . 2024. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems37 (2024), 125416–125440

  36. [36]

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023. 13003–13051

  37. [37]

    Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, and David Wag- ner. 2025. Defending against prompt injection with datafilter.arXiv preprint arXiv:2510.19207(2025)

  38. [38]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. Mmlu-pro: A more robust and challenging multi-task language understand- ing benchmark.Advances in Neural Information Processing Systems37 (2024), 95266–95290

  39. [39]

    Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. 2026. Jailbreak and guard aligned language models with only few in-context demonstrations. IEEE Transactions on Pattern Analysis and Machine Intelligence(2026)

  40. [40]

    2022.Prompt Injection Attacks against GPT-3

    Simon Willison. 2022.Prompt Injection Attacks against GPT-3. https:// simonwillison.net/2022/Sep/12/prompt-injection/

  41. [41]

    2023.Delimiters Won’t Save You from Prompt Injection

    Simon Willison. 2023.Delimiters Won’t Save You from Prompt Injection. https: //simonwillison.net/2023/May/11/delimiters-wont-save-you/

  42. [42]

    Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. 2024. Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148(2024)

  43. [43]

    2024.𝑏𝑒𝑡𝑎 -DPO: Direct Preference Optimization with Dynamic 𝑏𝑒𝑡𝑎 .Advances in Neural Information Processing Systems37 (2024), 129944–129966

    Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. 2024.𝑏𝑒𝑡𝑎 -DPO: Direct Preference Optimization with Dynamic 𝑏𝑒𝑡𝑎 .Advances in Neural Information Processing Systems37 (2024), 129944–129966

  44. [44]

    Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song

  45. [45]

    Generating adversarial examples with adversarial networks.arXiv preprint arXiv:1801.02610(2018)

  46. [46]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  47. [47]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Yuyang Gong, Zihao Wang, Jiawei Liu, and XiaoFeng Wang conference on empirical methods in natural language processing. 2369–2380

  48. [48]

    Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pan, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, et al. 2025. A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 6216–6226

  49. [49]

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024. 10471–10506

  50. [50]

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911(2023)

  51. [51]

    Yue Zhou, Henry Peng Zou, Barbara Di Eugenio, and Yang Zhang. 2024. Large language models are involuntary truth-tellers: Exploiting fallacy failure for jailbreak attacks. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 13293–13304

  52. [52]

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)

  53. [53]

    Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H Lampert. 2024. Can llms separate instructions from data? and what do we even mean by that?arXiv preprint arXiv:2403.06833(2024). Ethical Considerations We conducted this research in accordance with established ethical guidelines and best practices. All experiments were performed in...

  54. [54]

    B must be in the same broad task family as A

  55. [55]

    B must be executable on Anchor input C

  56. [56]

    B must NOT be a paraphrase, tone rewrite, or formatting rewrite of A

  57. [57]

    B must require a meaningfully different response from A

  58. [58]

    B should be natural, concise, and a single instruction sentence

  59. [59]

    instruction

    Do not mention A, C, or any meta explanation. Anchor instruction A: {anchor_instruction} Anchor input C: {anchor_input} Return JSON only: {{"instruction": "..."}} Figure 4: Prompt template for near-target adversarial com- mand generation. Anchor instruction A corresponds to the trusted command 𝑥cmd, Anchor input C corresponds to the untrusted data 𝑥data, ...