Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

Aaditya Pai

arxiv: 2606.18530 · v1 · pith:7KUKBBJGnew · submitted 2026-06-16 · 💻 cs.CR · cs.CL· cs.LG

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

Aaditya Pai This is my paper

Pith reviewed 2026-06-26 23:43 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG

keywords prompt injectiondomain camouflageAI agent securityprompting defensesLLM securityinjection attacks

0 comments

The pith

Paraphrasing retrieved content before agent processing most effectively defends against domain-camouflaged injection attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks five prompting-based defenses against attacks that embed malicious instructions in retrieved content using domain-appropriate vocabulary. Across three model families, three domains, and 3510 trials, paraphrasing the content before processing reduces attack success rates by 55-84 percent and outperforms a Llama Guard 4 baseline on every model. Defense performance varies sharply by model, with spotlighting effective on Claude Haiku but useless on Llama 3.1 8B, and financial domains retain the highest residual risk. The evaluation relies on synthetically generated professional documents.

Core claim

Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested.

What carries the argument

The benchmark comparing five prompting defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) on synthetically constructed professional documents in financial, legal, and general domains.

Load-bearing premise

Synthetically constructed professional documents capture the relevant properties of real enterprise documents for measuring domain-camouflaged attack success.

What would settle it

Running the same attacks and defenses on a corpus of actual enterprise documents from the three domains and checking whether paraphrasing retains the top ranking and the reported reduction percentages.

Figures

Figures reproduced from arXiv: 2606.18530 by Aaditya Pai.

**Figure 2.** Figure 2: Baseline attack success rate with no defense: static (override-directive) vs. domain [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Financial-domain camouflage example (Gemini 2.0 Flash, task fin [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Camouflage ASR (%) across nine model–domain combinations (rows) and six [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First benchmark of prompting defenses vs domain-camouflaged injections shows paraphrasing works best but leaves 26-33% residual risk in finance; synthetic docs are the main limit.

read the letter

The paper runs the first head-to-head test of five prompting defenses against domain-camouflaged injection attacks. It covers Claude Haiku, Llama 3.1 8B, and Gemini 2.0 Flash across financial, legal, and general domains with 3510 trials. Paraphrasing retrieved content before agent processing cuts attack success rates the most (55-84% depending on model) and beats their Llama Guard 4 baseline on every model. Spotlighting helps on Claude but does nothing on Llama, and financial deployments show the highest leftover risk at 26-33% even with defenses.

The work is straightforward empirical benchmarking and reports model- and domain-specific patterns that were not previously available. The scale of trials and the clear comparative numbers are the useful parts; practitioners who need to pick a defense for retrieval-augmented agents now have concrete relative rankings instead of just the abstract idea that prompting might help.

The soft spot is the synthetic documents. The abstract states outright that whether the rankings transfer to real enterprise files is unknown, and that matters because the attack succeeds by matching domain vocabulary and structure. If real documents differ in length, entity density, or retrieval noise, the ordering of paraphrasing versus sandwiching or spotlighting could shift and the reported reductions could shrink. The abstract also gives no error bars, success-rate definitions, or statistical tests, so the exact percentages are difficult to assess for robustness.

This is for teams deploying LLM agents that pull external content, especially in regulated areas. It supplies actionable comparisons even with the synthetic-data caveat. The paper is honest about its scope and the claims rest on measured outcomes rather than circular definitions, so it shows clear thinking. It deserves peer review; the main fixes needed are more discussion of the synthetic-to-real gap and basic statistical reporting.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection attacks. Using 3,510 trials on Claude Haiku, Llama 3.1 8B, and Gemini 2.0 Flash across financial, legal, and general domains with synthetically constructed professional documents, it concludes that paraphrasing retrieved content is the most consistently effective defense, reducing attack success rate by 55-84% depending on model and outperforming the authors' Llama Guard 4 configuration on every model. Effectiveness is model-dependent, financial domain shows highest residual risk (26-33%), and the paper supplies benchmark-based practitioner recommendations while explicitly noting that generalization from synthetic to real enterprise documents remains untested.

Significance. If the synthetic-document results transfer, the work supplies the first systematic benchmark focused on camouflage-class attacks and concrete guidance on defense selection. The scale (3,510 trials), multi-model/multi-domain design, and direct comparison to a guardrail baseline are strengths. The manuscript's explicit statement that generalization remains an open question is a positive transparency feature.

major comments (1)

[Abstract] Abstract: The central claim that paraphrasing is the most consistently effective defense and the provision of benchmark-based practitioner recommendations rest on the assumption that synthetically constructed documents reproduce the lexical, structural, and retrieval-context properties that determine attack success and defense performance in real enterprise documents. The abstract itself states that whether the reported rankings and residual risks (e.g., 26-33% in finance) generalize remains an open question; if real documents differ in entity density, length, or vocabulary distribution, the relative ordering of paraphrasing, spotlighting, and sandwiching could change.

minor comments (1)

[Abstract] Abstract: No definition of attack success rate, no mention of statistical tests or error bars, and no description of how the 3,510 trials are distributed across models/domains/defenses are supplied, even though these details are needed to interpret the reported percentage reductions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that paraphrasing is the most consistently effective defense and the provision of benchmark-based practitioner recommendations rest on the assumption that synthetically constructed documents reproduce the lexical, structural, and retrieval-context properties that determine attack success and defense performance in real enterprise documents. The abstract itself states that whether the reported rankings and residual risks (e.g., 26-33% in finance) generalize remains an open question; if real documents differ in entity density, length, or vocabulary distribution, the relative ordering of paraphrasing, spotlighting, and sandwiching could change.

Authors: We agree that the reported effectiveness rankings and residual risks are valid only to the extent that the synthetic documents capture the relevant lexical, structural, and retrieval properties of real enterprise documents. The abstract already states this limitation verbatim: 'All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.' Our central claims and practitioner recommendations are therefore explicitly scoped to the synthetic benchmark; we do not assert that paraphrasing will remain the most effective defense, or that the 26-33% residual risk in finance will hold, once real documents are substituted. The manuscript presents the results as the first systematic benchmark on camouflage-class attacks under controlled conditions, with the generalization question left open precisely because differences in entity density, length, or vocabulary could alter the relative ordering of the defenses. revision: no

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark with measured outcomes

full rationale

The paper reports results from 3,510 trials evaluating prompting defenses on synthetic documents across models and domains. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided abstract or description. The central claim (paraphrasing reduces ASR by 55-84%) is a measured empirical outcome, not a quantity defined in terms of itself. The abstract explicitly flags the synthetic-to-real generalization as an open question rather than assuming it. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study. No free parameters or invented entities are introduced. The sole load-bearing domain assumption is that synthetic documents are representative enough to rank defenses, which the abstract itself marks as unverified.

axioms (1)

domain assumption Synthetically constructed professional documents are sufficiently representative of real enterprise documents for ranking defense effectiveness against domain-camouflaged attacks.
All 3,510 trials use these documents; the abstract explicitly states generalization remains an open question.

pith-pipeline@v0.9.1-grok · 5779 in / 1481 out tokens · 45482 ms · 2026-06-26T23:43:19.358108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 9 canonical work pages · 7 internal anchors

[1]

AgentDojo : A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi \'c , Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. AgentDojo : A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. In Advances in Neural Information Processing Systems, 2024

2024
[2]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Nicholas Carlini, Milad Nasr, and Florian Tram \`e r. CaMeL : Defeating prompt injections by design. arXiv preprint arXiv:2503.18813, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

A survey on prompt injection attacks and defenses in large language models

Jiahao Geng, Fengyang Deng, Minghao Li, Juncheng Liu, and Chenghe Fu. A survey on prompt injection attacks and defenses in large language models. arXiv preprint arXiv:2410.01234, 2024

work page arXiv 2024
[4]

Benchmarking and defending against indirect prompt injection attacks on large language models,

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against prompt injection with hierarchical instruction following. arXiv preprint arXiv:2312.14197, 2024

work page arXiv 2024
[5]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard : LLM -based input-output safeguard for human- AI conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In Proceedings of the 33rd USENIX Security Symposium, 2024

2024
[8]

Llama Guard 3 : Meta llama 3 instruct-based LLM safety model

Meta AI . Llama Guard 3 : Meta llama 3 instruct-based LLM safety model. https://ai.meta.com/research/publications/llama-guard-3/, 2024

2024
[9]

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Aaditya Pai. Blind spots in the guard: How domain-camouflaged injection attacks evade detection in multi-agent LLM systems, 2026. URL https://arxiv.org/abs/2605.22001

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Ignore Previous Prompt: Attack Techniques For Language Models

F \'a bio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, Sevien Schulhoff, Pratyush Maini, Joan Nanda, Bharat Kambhampati, and Rodrigo Morales. The prompt report: A systematic survey of prompting techniques. arXiv preprint arXiv:2406.06608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

[1] [1]

AgentDojo : A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi \'c , Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. AgentDojo : A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. In Advances in Neural Information Processing Systems, 2024

2024

[2] [2]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Nicholas Carlini, Milad Nasr, and Florian Tram \`e r. CaMeL : Defeating prompt injections by design. arXiv preprint arXiv:2503.18813, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

A survey on prompt injection attacks and defenses in large language models

Jiahao Geng, Fengyang Deng, Minghao Li, Juncheng Liu, and Chenghe Fu. A survey on prompt injection attacks and defenses in large language models. arXiv preprint arXiv:2410.01234, 2024

work page arXiv 2024

[4] [4]

Benchmarking and defending against indirect prompt injection attacks on large language models,

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against prompt injection with hierarchical instruction following. arXiv preprint arXiv:2312.14197, 2024

work page arXiv 2024

[5] [5]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard : LLM -based input-output safeguard for human- AI conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In Proceedings of the 33rd USENIX Security Symposium, 2024

2024

[8] [8]

Llama Guard 3 : Meta llama 3 instruct-based LLM safety model

Meta AI . Llama Guard 3 : Meta llama 3 instruct-based LLM safety model. https://ai.meta.com/research/publications/llama-guard-3/, 2024

2024

[9] [9]

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Aaditya Pai. Blind spots in the guard: How domain-camouflaged injection attacks evade detection in multi-agent LLM systems, 2026. URL https://arxiv.org/abs/2605.22001

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Ignore Previous Prompt: Attack Techniques For Language Models

F \'a bio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, Sevien Schulhoff, Pratyush Maini, Joan Nanda, Bharat Kambhampati, and Rodrigo Morales. The prompt report: A systematic survey of prompting techniques. arXiv preprint arXiv:2406.06608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =