The Surface You Test Is Not the Surface That Breaks

Nafiul Haque; Shahrear Bin Amin; Shifat E Arman; Syed Nazmus Sakib

arxiv: 2605.30454 · v1 · pith:AMVLOTCEnew · submitted 2026-05-28 · 💻 cs.CR · cs.AI

The Surface You Test Is Not the Surface That Breaks

Shifat E Arman , Syed Nazmus Sakib , Nafiul Haque , Shahrear Bin Amin This is my paper

Pith reviewed 2026-06-29 06:35 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt injectionLLM agentstool-augmented modelsattack surfacesvulnerability evaluationcontext channelsadaptive attack rate

0 comments

The pith

The same prompt-injection bytes succeed or fail depending on which channel delivers them to the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that tool-augmented LLM agents face two distinct injection surfaces: the tool output the model receives after a call and the tool description it reads before every call. When identical payloads are sent through both surfaces on thirteen models, attack success rates invert across models, with some models highly vulnerable on outputs and nearly immune on descriptions, and vice versa. A variance breakdown of thousands of trials shows the surface alone explains none of the outcome differences while the model-surface pairing accounts for a sizable share. Standard defenses that appear effective on one surface leave the other exposed. Evaluations that report only a single surface therefore understate the actual risk to agents.

Core claim

Holding the injection payload byte-identical and testing it on both the tool-output channel and the tool-description channel across thirteen models from six families reveals that success rates can reverse completely between models. GPT-4.1 reaches 96 percent success on outputs but only 4 percent on descriptions, while GEMINI-3-FLASH shows the opposite pattern at 20 percent and 98 percent. Variance decomposition attributes zero percent of attack-outcome variation to surface alone and 16.7 percent to the model-surface interaction. The per-cell maximum over surfaces, termed the Adaptive Attack Rate, exceeds the strongest single-surface baseline by 9.1 percentage points on average. Prompt-level

What carries the argument

The model-surface interaction that determines attack success when the identical payload is delivered through either the tool-output channel or the tool-description channel.

If this is right

The Adaptive Attack Rate, defined as the maximum success rate over the two surfaces for each model, is the relevant security metric rather than any single-channel rate.
Prompt-level defenses must be evaluated separately on each surface or they will leave at least one channel open above 50 percent success.
Benchmarking protocols that test only the tool-output channel systematically underestimate agent vulnerability.
Attack and defense papers should report per-surface vulnerability numbers for every model examined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security testing of agents will need to enumerate every context channel an attacker can write into, not just the most obvious one.
Model providers may need to apply different input sanitization or context-separation rules to tool descriptions versus tool outputs.

Load-bearing premise

That success-rate differences between the two channels can be attributed to the delivery surface rather than to how each model internally processes the same bytes.

What would settle it

A replication in which attack success rates on tool outputs and tool descriptions remain statistically indistinguishable across a new set of models and tasks would falsify the claim that vulnerability is a property of the pairing.

Figures

Figures reproduced from arXiv: 2605.30454 by Nafiul Haque, Shahrear Bin Amin, Shifat E Arman, Syed Nazmus Sakib.

**Figure 2.** Figure 2: Per-model decomposition of the adaptive lift across 13 LLMs. For each model, the data-surface and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Cell-level surface preference across the 52-cell [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Cost-effectiveness of surface fingerprinting. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Leave-one-out and suite-restriction robustness [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Fine-grained surface-gap heatmap. SOMsigned resolved to the (suite, injection-task) level for each model; blue indicates a schema-surface preference, red a data-surface preference. The per-cell preferences of [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Two-dimensional embedding of the 26 (model, surface) behavior vectors, colored by surface. Same-surface [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Silent-execution rate by model and surface, [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Tool-augmented LLM agents are vulnerable to prompt injection: a third party who controls part of the agent's context can plant instructions that the agent then executes as if they came from the user. Current evaluations report a single attack success rate per model on one channel, the tool output and treat that number as the model's vulnerability. But tool descriptions, which the agent reads at every turn before any tool is called, are themselves an injection surface that the attacker can choose instead. We hold the injection payload byte-identical and deliver it through both surfaces across 13 LLMs from six families and four task suites. The same bytes invert in success rate across models: GPT-4.1 is 96 percent vulnerable on tool outputs but only 4 percent on tool descriptions, while GEMINI-3-FLASH shows the mirror pattern at 20 percent and 98 percent. A variance decomposition over 6,830 attempts attributes 0 percent of the variation in attack outcomes to the surface alone, while the model-surface interaction accounts for 16.7 percent. Vulnerability is a property of the pairing, not the channel. The Adaptive Attack Rate, defined as the per-cell maximum over surfaces, exceeds the strongest fixed-surface baseline by +9.1 percentage points on average. Standard prompt-level defenses inherit the same blindspot, reducing tool-output ASR to 10-18 percent while leaving the description channel above 54 percent. Both attack and defense evaluation must report per-surface vulnerability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows attack success rates can invert across models when the same payload hits tool descriptions versus outputs, but prompt structure differences may drive the effect more than the surface label.

read the letter

The core finding here is that the same injection payload produces opposite vulnerability patterns depending on the delivery surface and the model. GPT-4.1 hits 96% on tool outputs but only 4% on descriptions, while Gemini-3-Flash reverses that. Across 13 models and 6830 attempts the variance breakdown shows zero main effect from surface alone and 16.7% from the model-surface interaction. They define an Adaptive Attack Rate as the per-model maximum across surfaces and report it beats the best single-surface baseline by 9.1 points on average. Standard defenses also leave the description channel largely open.

What the work does cleanly is run byte-identical payloads on both channels across multiple families and task suites, then quantify the interaction term. That moves the conversation past single-channel reporting.

The soft spot is whether the two surfaces are isolated enough to attribute the difference to the channel itself. Tool descriptions live in the initial schema or system prompt and get read every turn in structured form. Tool outputs arrive later as function responses. Position, token neighborhood, and instruction stage all differ, so the observed flips could trace to those structural details rather than the surface label. The abstract does not spell out the exact prompt templates or attention patterns, which leaves the isolation claim open to the concern raised in the stress test.

This is for people who run or review agent security evaluations. It flags a concrete blind spot in current practice. The empirical scale and the interaction result are worth referee time even if the controls need tightening. I would send it to review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The paper claims that prompt injection vulnerability in tool-augmented LLM agents is a property of the model-surface pairing rather than the channel alone. Holding payloads byte-identical, the authors test tool outputs versus tool descriptions across 13 LLMs from six families and four task suites. They report model-specific inversions (GPT-4.1: 96% outputs vs 4% descriptions; GEMINI-3-FLASH: 20% vs 98%), a variance decomposition over 6,830 attempts attributing 0% variation to surface main effect and 16.7% to model-surface interaction, an Adaptive Attack Rate (per-cell max) exceeding the strongest fixed-surface baseline by +9.1 pp on average, and standard defenses reducing output ASR to 10-18% while leaving descriptions above 54%. The conclusion is that both attack and defense evaluations must report per-surface vulnerability.

Significance. If the surfaces are shown to be comparable after addressing structural confounds, the result would meaningfully shift evaluation practices in LLM agent security by demonstrating that single-channel ASR reporting is insufficient and that adaptive, multi-surface testing is required. The scale of the experiment (13 models, multiple families and task suites, 6,830 attempts) and the concrete quantification of interaction effects and defense blind spots are strengths that provide falsifiable, actionable findings for the field.

major comments (1)

[Abstract / Experimental Setup] Abstract and experimental design: the central claim that ASR differences can be attributed to the surface (rather than delivery differences) is load-bearing but rests on an assumption that byte-identical payloads through tool outputs versus tool descriptions constitute comparable attack surfaces. Tool descriptions are embedded in the initial system/tool schema (read every turn, structured XML/JSON), while tool outputs arrive later as function responses; these differ in position, surrounding tokens, attention patterns, and instruction-following stage. The reported variance decomposition (0% surface main effect, 16.7% interaction) does not isolate a true surface effect from these structural confounds, which could explain the observed model-specific inversions (e.g., GPT-4.1 96% vs 4%).

minor comments (2)

The abstract states 'four task suites' but does not name them; the main text should explicitly list the suites and their characteristics to support reproducibility.
Clarify in the methods how exactly the byte-identical payloads are embedded (e.g., exact prompt templates for each surface) so readers can assess the structural differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the distinction between payload content and structural delivery. Our response addresses the concern directly while preserving the core empirical finding that model-specific inversions occur even under byte-identical payloads.

read point-by-point responses

Referee: [Abstract / Experimental Setup] Abstract and experimental design: the central claim that ASR differences can be attributed to the surface (rather than delivery differences) is load-bearing but rests on an assumption that byte-identical payloads through tool outputs versus tool descriptions constitute comparable attack surfaces. Tool descriptions are embedded in the initial system/tool schema (read every turn, structured XML/JSON), while tool outputs arrive later as function responses; these differ in position, surrounding tokens, attention patterns, and instruction-following stage. The reported variance decomposition (0% surface main effect, 16.7% interaction) does not isolate a true surface effect from these structural confounds, which could explain the observed model-specific inversions (e.g., GPT-4.1 96% vs 4%).

Authors: We agree that the two surfaces differ in structural embedding, position, surrounding tokens, and processing stage; these differences are intrinsic to the surfaces rather than extraneous confounds. Our operational definition of 'surface' encompasses the full delivery mechanism (including schema embedding for descriptions and function-response formatting for outputs). The byte-identical payload controls for content while allowing the structural and positional differences to vary naturally. The variance decomposition is consistent with this view: the 0% surface main effect indicates neither surface is universally stronger, while the 16.7% interaction term captures the model-specific sensitivity to each delivery structure. The striking inversions (e.g., GPT-4.1 vs. GEMINI-3-FLASH) are difficult to attribute solely to unmeasured confounds because the same payload produces opposite outcomes across models. We will revise the abstract, methods, and discussion to explicitly define 'surface' as the complete delivery channel (including structural properties) and to note that the experiment does not attempt to factor out those properties from the surface itself. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of observed attack rates

full rationale

The paper reports experimental attack success rates (ASR) for byte-identical payloads delivered via two surfaces (tool outputs vs. tool descriptions) across 13 LLMs. Central results are the observed inversion patterns, variance decomposition (0% surface main effect, 16.7% interaction), and the definition of Adaptive Attack Rate as the per-cell maximum. These are direct aggregations and statistical summaries of measured outcomes, not derivations that reduce to fitted parameters or self-referential quantities. No equations, predictions, or uniqueness theorems appear; no self-citations are load-bearing for the claims. The evaluation is self-contained against external benchmarks (multiple models, task suites, and defense baselines).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security evaluation paper; the central claims rest on experimental observations across models and surfaces rather than on mathematical axioms, free parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5804 in / 1005 out tokens · 28417 ms · 2026-06-29T06:35:25.483193+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 12 canonical work pages · 8 internal anchors

[1]

Maksym Andriushchenko, Nicolas Flammarion, and 1 others. 2025. Jailbreaking leading safety-aligned llms with simple adaptive attacks. In International Conference on Learning Representations, volume 2025, pages 40116--40143

2025
[2]

Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, and 1 others. 2024. Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932

work page arXiv 2024
[3]

Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274--283. PMLR

2018
[4]

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. Abusing images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490

work page arXiv 2023
[5]

Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2024. Defending against alignment-breaking attacks via robustly aligned llm. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10542--10560

2024
[6]

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, and 1 others. 2024. Stealing part of a production language model. arXiv preprint arXiv:2403.06634

work page arXiv 2024
[8]

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23--42. IEEE

2025
[9]

Francesco Croce and Matthias Hein. 2020. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pages 2206--2216. PMLR

2020
[10]

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. 2024. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems, 37:82895--82920

2024
[11]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM workshop on artificial intelligence and security, pages 79--90

2023
[12]

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending against indirect prompt injection attacks with spotlighting. arXiv preprint arXiv:2403.14720

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Zongze Li, Jiawei Guo, and Haipeng Cai. 2025. System prompt poisoning: Persistent attacks on large language models beyond user injection. arXiv preprint arXiv:2505.06493

work page arXiv 2025
[14]

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, and 1 others. 2025. The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections. arXiv preprint arXiv:2510.09023

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2024. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37:126544--126565

2024
[16]

F \'a bio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2024. Toolllm: Facilitating large language models to master 16000+ real-world apis. In International Conference on Learning Representations, volume 2024, pages 9695--9717

2024
[18]

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. 2023. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539--68551

2023
[20]

Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. 2025. Prompt injection attack to tool selection in llm agents. arXiv preprint arXiv:2504.19793

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. 2020. On adaptive attacks to adversarial example defenses. Advances in neural information processing systems, 33:1633--1645

2020
[22]

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. 2026. Mcptox: A benchmark for tool poisoning on real-world mcp servers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35811--35819

2026
[24]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079--80110

2023
[25]

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and defending against indirect prompt injection attacks on large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 1809--1820

2025
[26]

Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. 2025. Adaptive attacks break defenses against indirect prompt injection attacks on llm agents. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7101--7117

2025
[27]

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10471--10506

2024
[28]

Rupeng Zhang, Haowei Wang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, and Qing Wang. 2025. From allies to adversaries: Manipulating llm tool-calling through adversarial injection. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

2025
[29]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[31]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

Maksym Andriushchenko, Nicolas Flammarion, and 1 others. 2025. Jailbreaking leading safety-aligned llms with simple adaptive attacks. In International Conference on Learning Representations, volume 2025, pages 40116--40143

2025

[2] [2]

Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, and 1 others. 2024. Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932

work page arXiv 2024

[3] [3]

Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274--283. PMLR

2018

[4] [4]

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. Abusing images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490

work page arXiv 2023

[5] [5]

Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2024. Defending against alignment-breaking attacks via robustly aligned llm. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10542--10560

2024

[6] [6]

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, and 1 others. 2024. Stealing part of a production language model. arXiv preprint arXiv:2403.06634

work page arXiv 2024

[8] [8]

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23--42. IEEE

2025

[9] [9]

Francesco Croce and Matthias Hein. 2020. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pages 2206--2216. PMLR

2020

[10] [10]

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. 2024. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems, 37:82895--82920

2024

[11] [11]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM workshop on artificial intelligence and security, pages 79--90

2023

[12] [12]

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending against indirect prompt injection attacks with spotlighting. arXiv preprint arXiv:2403.14720

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Zongze Li, Jiawei Guo, and Haipeng Cai. 2025. System prompt poisoning: Persistent attacks on large language models beyond user injection. arXiv preprint arXiv:2505.06493

work page arXiv 2025

[14] [14]

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, and 1 others. 2025. The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections. arXiv preprint arXiv:2510.09023

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2024. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37:126544--126565

2024

[16] [16]

F \'a bio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2024. Toolllm: Facilitating large language models to master 16000+ real-world apis. In International Conference on Learning Representations, volume 2024, pages 9695--9717

2024

[18] [18]

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. 2023. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539--68551

2023

[20] [20]

Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. 2025. Prompt injection attack to tool selection in llm agents. arXiv preprint arXiv:2504.19793

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. 2020. On adaptive attacks to adversarial example defenses. Advances in neural information processing systems, 33:1633--1645

2020

[22] [22]

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. 2026. Mcptox: A benchmark for tool poisoning on real-world mcp servers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35811--35819

2026

[24] [24]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079--80110

2023

[25] [25]

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and defending against indirect prompt injection attacks on large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 1809--1820

2025

[26] [26]

Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. 2025. Adaptive attacks break defenses against indirect prompt injection attacks on llm agents. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7101--7117

2025

[27] [27]

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10471--10506

2024

[28] [28]

Rupeng Zhang, Haowei Wang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, and Qing Wang. 2025. From allies to adversaries: Manipulating llm tool-calling through adversarial injection. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

2025

[29] [29]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[31] [31]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...