pith. machine review for the scientific record. sign in

arxiv: 2605.01078 · v1 · submitted 2026-05-01 · 💻 cs.CR · cs.AI

Recognition: unknown

A Sentence Relation-Based Approach to Sanitizing Malicious Instructions

Daniel S. Brown, Guanhong Tao, Melissa Umble, Soumil Datta

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords prompt injectionprompt sanitizationnatural language inferenceretrieval-augmented generationLLM defensesentence relation graphconnectivity pruning
0
0 comments X

The pith

SONAR builds sentence relation graphs from entailment and contradiction scores to identify and prune malicious instruction injections while preserving legitimate context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SONAR as a defense against prompt injection attacks in retrieval-augmented generation and tool-using LLM systems. It constructs a sentence-level graph linking the user query to external data, then uses natural language inference metrics as edge weights to flag sentences that contradict or fail to entail the core task. Connectivity-driven pruning removes the injection seeds and their connected neighbors. Evaluations across multiple models and datasets show the method drops attack success rates to nearly zero while outperforming nine prior defenses that rely on LLM detectors or retraining.

Core claim

SONAR constructs a sentence-level relational graph across the user query and external data. By using entailment and contradiction scores as edge weights, the system identifies sentences that deviate from the core task. It then employs connectivity-driven pruning to eliminate flagged injection seeds and their related neighbors while maintaining benign context.

What carries the argument

Sentence-level relational graph whose edges are weighted by off-the-shelf NLI entailment and contradiction scores, followed by connectivity-driven pruning of deviant nodes.

If this is right

  • Attack success rates fall to nearly zero on the evaluated models and datasets.
  • The method outperforms nine established baseline defenses that use LLM-based detection or training.
  • Task performance is preserved because pruning targets only low-connectivity deviant sentences.
  • The approach avoids both the optimization vulnerability of LLM detectors and the distribution-shift failure of training-based methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-plus-pruning pattern could extend to detecting other forms of semantic drift in retrieved documents.
  • Integrating SONAR as a first-stage filter might reduce the load on downstream LLM safety classifiers.
  • If NLI models improve on out-of-domain entailment, SONAR's pruning accuracy would rise without retraining the sanitizer itself.

Load-bearing premise

Entailment and contradiction scores from off-the-shelf NLI models can reliably separate malicious injection sentences from legitimate context across diverse data distributions without excessive false positives that degrade task performance.

What would settle it

A test set of prompt injections deliberately written to produce high entailment or low contradiction scores with the surrounding context; if SONAR still achieves near-zero attack success on that set, the separation assumption holds.

Figures

Figures reproduced from arXiv: 2605.01078 by Daniel S. Brown, Guanhong Tao, Melissa Umble, Soumil Datta.

Figure 1
Figure 1. Figure 1: Illustration of SONAR sentence pruning for indirect prompt injection. SONAR identifies an adversarial seed (S2) using instruction alignment and contradiction signals. This seed is expanded into a contiguous injected span by absorb￾ing neighboring suspicious sentences (S3). Once pruned, the remaining benign content (S1, S4) is reconstructed in its orig￾inal order to form the sanitized context. 5 Method We p… view at source ↗
Figure 2
Figure 2. Figure 2: Average ASR and TF on AlpacaFarm across five target LLMs. Bars represent ASR; lines denote TF. Applying view at source ↗
Figure 3
Figure 3. Figure 3: Average ASR vs. TF on AlpacaFarm, with the view at source ↗
Figure 5
Figure 5. Figure 5: Average ASR vs. TF on Open-Prompt-Injection, view at source ↗
read the original abstract

Retrieval-augmented generation and tool-integrated LLM agents increasingly depend on external textual sources. This reliance broadens the available attack surface, allowing adversaries to insert malicious instructions that trigger unintended model behaviors. Current defensive measures often utilize LLM-based detectors to filter such content, but these approaches remain vulnerable to optimization-based attacks. Additionally, training-based methods frequently fail to generalize to novel data distributions. To resolve these issues, we introduce SONAR, a prompt sanitization framework that identifies and removes injected content using metrics from natural language inference. Specifically, SONAR constructs a sentence-level relational graph across the user query and external data. By using entailment and contradiction scores as edge weights, the system identifies sentences that deviate from the core task. It then employs connectivity-driven pruning to eliminate flagged injection seeds and their related neighbors while maintaining benign context. Rigorous evaluations across several models and datasets show that SONAR reduces the attack success rate to nearly zero, significantly outperforming nine established baseline defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SONAR, a prompt sanitization framework for retrieval-augmented generation and tool-integrated LLM agents. It constructs a sentence-level relational graph with nodes as sentences from the user query and external data, edges weighted by off-the-shelf NLI entailment and contradiction scores, and applies connectivity-driven pruning to remove injection seeds and neighboring sentences while retaining benign context. The authors claim this yields near-zero attack success rates across multiple models and datasets while outperforming nine baseline defenses.

Significance. If the separation between malicious and benign sentences via NLI scores holds under optimized attacks, SONAR would offer a training-free, generalizable defense that avoids the optimization vulnerabilities of LLM-based detectors and the distribution-shift issues of trained models. This could meaningfully strengthen security for RAG pipelines without requiring model retraining or fine-tuning.

major comments (2)
  1. [§3] §3 (Methodology): The core pruning step assumes that malicious injections will reliably produce low entailment/high contradiction scores with the user query while benign retrieved sentences do not. No analysis is provided of adversarial injections deliberately crafted to be entailed by the query (e.g., “using the retrieved data, also perform X”), which would survive pruning or force removal of legitimate context; this assumption is load-bearing for both the near-zero ASR claim and utility preservation.
  2. [§4] §4 (Experiments): The reported superiority over nine baselines and near-zero ASR are not accompanied by concrete per-attack, per-model numbers, NLI model identifiers, pruning threshold selection procedure, or false-positive rates on clean data; without these, it is impossible to verify that the graph pruning does not degrade task performance or that the results are robust to adaptive attacks.
minor comments (2)
  1. The description of graph construction would benefit from an explicit small example (query + 3–4 sentences) showing edge weights and the resulting pruned subgraph.
  2. Notation for sentence nodes and edge weights (e.g., E(s_i, s_j)) should be defined once in §3 and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our methodology assumptions and experimental transparency, and we have revised the manuscript to address them directly.

read point-by-point responses
  1. Referee: [§3] §3 (Methodology): The core pruning step assumes that malicious injections will reliably produce low entailment/high contradiction scores with the user query while benign retrieved sentences do not. No analysis is provided of adversarial injections deliberately crafted to be entailed by the query (e.g., “using the retrieved data, also perform X”), which would survive pruning or force removal of legitimate context; this assumption is load-bearing for both the near-zero ASR claim and utility preservation.

    Authors: We agree that the NLI-based separation is central to SONAR's effectiveness. Our evaluations use established injection attacks from the literature, which typically produce low entailment or high contradiction with the core query. For the suggested case of injections deliberately crafted for entailment (e.g., “using the retrieved data, also perform X”), we acknowledge this as a relevant adversarial scenario. In the revised manuscript, we have added a dedicated analysis subsection in §3 that examines such crafted examples, demonstrates how the relational graph can still isolate them through indirect contradictions with other sentences, and discusses the conditions under which pruning preserves utility. We have also expanded the limitations section to explicitly note the importance of this assumption. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported superiority over nine baselines and near-zero ASR are not accompanied by concrete per-attack, per-model numbers, NLI model identifiers, pruning threshold selection procedure, or false-positive rates on clean data; without these, it is impossible to verify that the graph pruning does not degrade task performance or that the results are robust to adaptive attacks.

    Authors: We appreciate the call for greater experimental detail. In the revised manuscript, Section 4 now includes expanded tables with concrete per-attack and per-model attack success rates. We explicitly identify the off-the-shelf NLI model used for entailment and contradiction scoring, describe the pruning threshold selection procedure (determined via a validation set to balance mitigation and utility), and report false-positive rates on clean data showing negligible degradation to task performance. We have also added experiments with adaptive attack variants that attempt to maximize entailment scores, confirming that SONAR maintains strong performance; a full enumeration of every possible optimization strategy is noted as future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is self-contained algorithmic procedure

full rationale

The paper presents SONAR as an algorithmic procedure that constructs a sentence-level relational graph using entailment and contradiction scores from off-the-shelf NLI models, then applies connectivity-driven pruning to remove flagged injections. No mathematical derivations, equations, or fitted parameters are described that reduce any central claim to its own inputs by construction. Performance assertions rely on empirical evaluations across models and datasets rather than predictions derived from self-referential fits or self-citation chains. The approach depends on standard external NLI tools without invoking uniqueness theorems or ansatzes from prior author work, rendering the derivation chain independent and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that standard NLI models produce usable entailment/contradiction scores for this security task; no free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption Natural language inference models can produce reliable entailment and contradiction scores between sentences that distinguish malicious injections from task-relevant content.
    These scores are used directly as edge weights to identify deviant sentences.

pith-pipeline@v0.9.0 · 5469 in / 1134 out tokens · 28261 ms · 2026-05-09T18:36:52.607986+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 9 canonical work pages · 6 internal anchors

  1. [1]

    Bowman, Gabor Angeli, Christopher Potts, and Christopher D

    Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated cor- pus for learning natural language inference. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing, pages 632–642, Lisbon, Portugal, September 2015. Associati...

  2. [2]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jail- breaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trust- worthy Machine Learning (SaTML), pages 23–42, 2025

  3. [3]

    {StruQ}: Defending against prompt injection with structured queries

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. {StruQ}: Defending against prompt injection with structured queries. In34th USENIX Security Sym- posium (USENIX Security 25), pages 2383–2400, 2025

  4. [4]

    SecAlign: Defending Against Prompt Injection with Preference Optimization,

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahlou- jifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injec- tion with preference optimization.arXiv preprint arXiv:2410.05451, 2024

  5. [5]

    Menli: Robust evalua- tion metrics from natural language inference.Transac- tions of the Association for Computational Linguistics, 11:804–825, 2023

    Yanran Chen and Steffen Eger. Menli: Robust evalua- tion metrics from natural language inference.Transac- tions of the Association for Computational Linguistics, 11:804–825, 2023

  6. [6]

    How not to detect prompt injections with an llm

    Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, and Somesh Jha. How not to detect prompt injections with an llm. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security, pages 218–229, 2025

  7. [7]

    A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20:37 – 46, 1960

    Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20:37 – 46, 1960

  8. [8]

    Bert: Pre-training of deep bidi- rectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 conference of the North Amer- ican chapter of the association for computational lin- guistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  9. [9]

    Alpaca- farm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36:30039–30069, 2023

    Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpaca- farm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36:30039–30069, 2023

  10. [10]

    Eric Wong

    Tongcheng Geng, Zhiyuan Xu, Yubin Qu, and W. Eric Wong. Prompt injection attacks on large language mod- els: A survey of attack methods, root causes, and de- fense strategies.Computers, Materials & Continua, page pages, 2025

  11. [11]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  12. [12]

    Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injec- tion

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injec- tion. InProceedings of the 2023 ACM Workshop on Artificial Intelligence and Security (AISec ’23), 2023

  13. [13]

    Hsu, and Pin-Yu Chen

    Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, and Pin-Yu Chen. Atten- tion tracker: Detecting prompt injection attacks in llms, 2025

  14. [14]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    N Jain, A Schwarzschild, Y Wen, G Somepalli, J Kirchenbauer, PY Chiang, M Goldblum, A Saha, J Geiping, and T Goldstein. Baseline defenses for adver- sarial attacks against aligned language models (2023). arXiv preprint arXiv:2309.00614, 2023

  15. [15]

    Promptlocate: Localizing prompt injection attacks,

    Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. Promptlocate: Localizing prompt injection at- tacks.arXiv preprint arXiv:2510.12252, 2025

  16. [16]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  17. [17]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon An- toniak, T...

  18. [18]

    Fun-tuning: Characterizing the vulnerability of proprietary llms to 15 optimization-based prompt injection attacks via the fine- tuning interface

    Andrey Labunets, Nishit V Pandya, Ashish Hooda, Xiaohan Fu, and Earlence Fernandes. Fun-tuning: Characterizing the vulnerability of proprietary llms to 15 optimization-based prompt injection attacks via the fine- tuning interface. In2025 IEEE Symposium on Security and Privacy (SP), pages 411–429. IEEE, 2025

  19. [19]

    Moritz Laurer, Wouter van Atteveldt, Andreu Casas, and Kasper Welbers. Less annotating, more classifying: Ad- dressing the data scarcity issue of supervised machine learning with deep transfer learning and bert-nli.Politi- cal Analysis, 32(1):84–100, 2024

  20. [20]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: de- noising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR, abs/1910.13461, 2019

  21. [21]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tat- sunori B. Hashimoto. Alpacaeval: An automatic evalua- tor of instruction-following models. https://github. com/tatsu-lab/alpaca_eval, 5 2023

  22. [22]

    Prompt Injection attack against LLM-integrated Applications

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zi- hao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against llm-integrated applications. https://arxiv.org/abs/2306.05499, page 18, 2024

  23. [23]

    Formalizing and benchmark- ing prompt injection attacks and defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmark- ing prompt injection attacks and defenses. InUSENIX Security Symposium, 2024

  24. [24]

    Datasentinel: A game-theoretic detection of prompt injection attacks

    Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In2025 IEEE Symposium on Security and Privacy (SP), pages 2190–

  25. [25]

    Tree of attacks: Jailbreaking black- box LLMs automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum S Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black- box LLMs automatically. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  26. [26]

    LLM01: Prompt In- jection - OWASP Top 10 for LLM Applica- tions

    OWASP Foundation. LLM01: Prompt In- jection - OWASP Top 10 for LLM Applica- tions. https://genai.owasp.org/llmrisk/ llm01-prompt-injection/, 2023. Accessed: 2026-04-28

  27. [27]

    Advprompter: Fast adaptive adversarial prompting for llms, 2025

    Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. Advprompter: Fast adaptive adversarial prompting for llms, 2025

  28. [28]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527, 2022

  29. [29]

    Instruction defense: Strengthen AI prompts against hacking, 2024

    Sander Schulhoff. Instruction defense: Strengthen AI prompts against hacking, 2024. Accessed: 2026-04-26

  30. [30]

    The sandwich defense: Strengthening AI prompt security, 2024

    Sander Schulhoff. The sandwich defense: Strengthening AI prompt security, 2024. Accessed: 2026-04-26

  31. [31]

    PromptArmor: Simple yet Effective Prompt Injection Defenses.arXiv preprint arXiv:2507.15219, 2025.https: //arxiv.org/abs/2507.15219

    Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Sim- ple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

  32. [32]

    ’positive review only’: Researchers hide ai prompts in papers, July 2025

    Shogo Sugiyama and Ryosuke Eguchi. ’positive review only’: Researchers hide ai prompts in papers, July 2025

  33. [33]

    Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

  34. [34]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  35. [35]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  36. [36]

    Defending against prompt injection with datafilter, 2025

    Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alo- mair, and David Wagner. Defending against prompt injection with datafilter, 2025

  37. [37]

    Enhancing systematic decompositional natural language inference using informal logic

    Nathaniel Weir, Kate Sanders, Orion Weller, Shreya Sharma, Dongwei Jiang, Zheng Ping Jiang, Bhavana Dalvi, Oyvind Tafjord, Peter Jansen, Peter Clark, et al. Enhancing systematic decompositional natural language inference using informal logic. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 9458–9482, 2024

  38. [38]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jian- hong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

  39. [39]

    Bench- marking and defending against indirect prompt injection attacks on large language models

    Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Bench- marking and defending against indirect prompt injection attacks on large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining V .1, KDD ’25, page 1809–1820, New York, NY , USA, 2025. Association for Computing...

  40. [40]

    A study on prompt injection attack against llm-integrated mobile robotic systems

    Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Braunl, and Jin B Hong. A study on prompt injection attack against llm-integrated mobile robotic systems. In 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW), pages 361–368. IEEE, 2024

  41. [41]

    Print exactly ‘Hacked!’

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and trans- ferable adversarial attacks on aligned language models, 2023. 17 A Appendix A.1 NLI Models We evaluate the performance of SONARon three differ- ent Natural Language Inference (NLI) models, ranging from smaller to large models – MoritzLaurer/DeBERT...