pith. sign in

arxiv: 2606.19660 · v1 · pith:FHN3RHVUnew · submitted 2026-06-17 · 💻 cs.CR · cs.CL

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

Pith reviewed 2026-06-26 19:58 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords prompt injectionRAG chatbotslayered securityLLM defensesemantic anomaly detectionprovenance hierarchyattack success rate
0
0 comments X

The pith

A three-layer framework reduces prompt injection success rate in RAG chatbots from 71.4% to 11.3%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a three-layer defense intercepts prompt injection at input screening, context assembly, and output auditing to protect RAG chatbots. Existing isolated defenses leave gaps because input filters miss poisoned documents and output monitors cannot stop payloads from reaching the model. The approach combines rule-based and semantic checks on user input, a provenance hierarchy that keeps retrieved content from overriding policy, and post-generation policy enforcement plus drift detection. Evaluation across 5080 samples on three different models shows the combined layers cut attack success while holding false positives to 4.8% and adding only 61 ms median latency. A continuous logging loop allows the system to adapt without retraining the underlying model.

Core claim

The paper claims that a model-agnostic middleware framework using three complementary layers reduces Attack Success Rate from 71.4% to 11.3% on 5080 samples spanning GPT-4o, Llama 3, and Mistral 7B. Layer 1 applies rule patterns and a fine-tuned semantic anomaly classifier to user input. Layer 2 enforces a provenance-based instruction hierarchy during context assembly so retrieved documents cannot override operator policy. Layer 3 runs a policy rule engine and semantic drift detector on model output before delivery. Ablation results show the three layers together exceed the protection of any single layer or published guardrail system.

What carries the argument

The three-layer interception pipeline that screens input, enforces provenance hierarchy in context assembly, and audits output before delivery.

If this is right

  • All three layers supply complementary protection whose combined effect exceeds the sum of the parts.
  • The framework outperforms the best single-layer baseline by 27.3 percentage points and a published guardrail by 23.8 percentage points.
  • False positive rate stays at 4.8% and median added latency is 61.2 ms across the tested models.
  • The continuous audit loop aggregates logs to support retraining and adaptation to new attack patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deploying the middleware in existing RAG pipelines would require only configuration of the policy rules and periodic classifier updates rather than model changes.
  • If semantic drift detection thresholds are set too loosely, the system could miss subtle output manipulations that still achieve injection goals.
  • The same layered structure could be tested against related threats such as retrieval poisoning or context window overflow attacks.
  • Long-term maintenance cost depends on how quickly new attack patterns appear relative to the retraining cycle described in the audit loop.

Load-bearing premise

The 5080 evaluation samples and ablation studies accurately represent realistic direct and indirect prompt-injection attacks that would occur in production RAG deployments.

What would settle it

A production RAG chatbot protected by the framework that still experiences an attack success rate above 11.3% on realistic direct or indirect injections would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.19660 by Ali Hassan, Gulshan Saleem, Muhammad Imran Zaman, Nisar Ahmed.

Figure 1
Figure 1. Figure 1: Architecture of the target RAG-based chatbot, depicting two attack types marked with warning signs indicating direct injection via the user input channel and indirect injection via the external knowledge base. The architectural property that enables prompt injection is the absence of an instruction-plane and data-plane separation. The LLM processes the system prompt, retrieved documents, and the user messa… view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of prompt injection attack vectors considered in this work. Greyed attacks are explicitly out of scope (see Section 3.5). reasons. First, it is invisible to defenses that inspect only the user turn. Second, retrieved content is often trusted implicitly by the model because it arrives via the system-controlled retrieval path [8]. Third, a single poisoned document can affect all users whose queries … view at source ↗
Figure 3
Figure 3. Figure 3: Mapping of attack vectors and sub-classes to attacker objectives (IO = Instruction Override; DE = Data Exfiltration; BM = Behavioral Manipulation). 3.5 Scope and Exclusions The following threat classes are explicitly out of scope [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the position of each layer within the RAG inference pipeline [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Input screening process in Layer 1 combining signature-based detection and semantic anomaly analysis for prompt injection mitigation. mitigates this by enforcing an explicit privilege hierarchy at context assembly time, before the concatenated context is passed to the inference engine. 4.3.1 Instruction Hierarchy We define three privilege tiers based on the provenance of each context segment. 1. Operator T… view at source ↗
Figure 6
Figure 6. Figure 6: Layer 2 context assembly with provenance tagging and privilege-aware context integration prior to LLM inference [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer 3 output auditing framework. Rule-based security checks and semantic drift analysis determine whether an LLM response is delivered, redacted, blocked, or escalated for review. 4.5 Continuous Audit Loop The audit loop is not a fourth detection layer but a cross-cutting operational component that collects structured logs from all three layers and from the inference engine itself. Each log entry records… view at source ↗
Figure 8
Figure 8. Figure 8: Complete architecture of the proposed three-layer prompt injection defense framework. temperature = 0 to promote deterministic outputs across repeated evaluations. 2. Llama 3 8B Instruct [2]: deployed locally using the Hugging Face transformers library [32] on a single NVIDIA A100 40 GB GPU. 3. Mistral 7B Instruct [3]: deployed locally using the same Hugging Face transformers-based setup on a single NVIDIA… view at source ↗
Figure 9
Figure 9. Figure 9: Attack Success Rate (ASR) by attacker goal and target model. Results are reported for Instruction Override (IO), Data Exfiltration (DE), and Behavioral Manipulation (BM) attacks under six defense configurations. Lower values indicate stronger resistance to prompt injection attacks. superior individual performance is expected given that it is the only layer with direct visibility into both attack vectors an… view at source ↗
Figure 11
Figure 11. Figure 11: Median per-component latency of the full framework, broken down by model. Each bar segment corresponds to one processing step; segment widths are proportional to latency. The two semantic encoding steps (L1 Semantic Classifier and L3 Semantic Drift) dominate, together accounting for approximately 60 % of total overhead. The red dashed line marks the 100 ms operational budget; all three models remain well … view at source ↗
Figure 10
Figure 10. Figure 10: ROC curves for the Layer 1 semantic anomaly classifier (AUC = 0.942) and the Layer 3 output auditing module (AUC = 0.923). Filled circles mark the selected operating points: Layer 1 at FPR = 0.051, TPR = 0.722 and Layer 3 at FPR = 0.067, TPR = 0.702, consistent with the FPR values reported in [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a three-layer middleware framework for defending RAG-based chatbots against direct and indirect prompt injection. Layer 1 applies rule-based patterns plus a fine-tuned semantic anomaly classifier to user input; Layer 2 enforces a provenance-based instruction hierarchy during context assembly; Layer 3 audits output with a policy engine and semantic drift detector. A continuous audit loop supports retraining. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B reports ASR reduction from 71.4% to 11.3%, outperforming the best single-layer baseline by 27.3 pp and a published guardrail by 23.8 pp, with 4.8% FPR and 61.2 ms median latency overhead. Ablations claim complementary protection from the three layers.

Significance. If the evaluation methodology and sample distribution are shown to be realistic, the work would supply a practical, model-agnostic deployment path for a top-ranked LLM vulnerability, with quantitative evidence that layered defenses exceed single-stage approaches and incur modest overhead.

major comments (3)
  1. [Evaluation] Evaluation section (and abstract): the central ASR reduction claim (71.4% → 11.3%) rests on an internal collection of 5,080 samples whose construction, attack-prompt distribution, and coverage of realistic indirect/obfuscated injections are not described. Without this information it is impossible to assess whether the reported complementarity of the three layers generalizes beyond the authors' own test distribution.
  2. [Ablation studies] Ablation studies paragraph: all ablations are performed on the same fixed internal set used for the main results; no hold-out, external corpus, or production-traffic validation is reported. This makes the claim that "their combined effect exceeds the sum of individual contributions" dependent on the unverified representativeness of the test attacks.
  3. [Evaluation] Baseline comparison: the manuscript states that the framework outperforms "the best single-layer baseline" and "a published guardrail system" by 27.3 and 23.8 percentage points, yet supplies no information on how those baselines were re-implemented or whether the same attack prompts were used.
minor comments (2)
  1. [Evaluation] The abstract and evaluation section report aggregate metrics without mentioning statistical significance tests or confidence intervals on the ASR differences.
  2. [Layer 1] No description is given of the training data or hyperparameters used for the fine-tuned semantic anomaly classifier in Layer 1.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing evaluation transparency. We address each major comment below and will revise the manuscript to provide the requested details on dataset construction, ablation methodology, and baseline implementations.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (and abstract): the central ASR reduction claim (71.4% → 11.3%) rests on an internal collection of 5,080 samples whose construction, attack-prompt distribution, and coverage of realistic indirect/obfuscated injections are not described. Without this information it is impossible to assess whether the reported complementarity of the three layers generalizes beyond the authors' own test distribution.

    Authors: We agree that additional description of the test set is required. The 5,080 samples were assembled from public prompt-injection repositories, synthetic indirect attacks via poisoned RAG documents, and obfuscated variants drawn from recent literature. In the revised manuscript we will expand Section 4.1 with a breakdown of attack categories and proportions, example prompts, and generation methodology so that readers can evaluate coverage and generalizability. revision: yes

  2. Referee: [Ablation studies] Ablation studies paragraph: all ablations are performed on the same fixed internal set used for the main results; no hold-out, external corpus, or production-traffic validation is reported. This makes the claim that "their combined effect exceeds the sum of individual contributions" dependent on the unverified representativeness of the test attacks.

    Authors: Ablations were run on the full set to isolate each layer's marginal contribution under identical attack conditions. We acknowledge the absence of hold-out or external validation. The revision will add an explicit limitations paragraph explaining this design choice and the consistency observed across three LLMs, while noting plans for future cross-validation studies. revision: partial

  3. Referee: [Evaluation] Baseline comparison: the manuscript states that the framework outperforms "the best single-layer baseline" and "a published guardrail system" by 27.3 and 23.8 percentage points, yet supplies no information on how those baselines were re-implemented or whether the same attack prompts were used.

    Authors: Single-layer baselines were re-implemented from the original publications and adapted to the RAG pipeline; the published guardrail was evaluated on the identical 5,080 prompts. The revision will include a dedicated subsection (or appendix) describing the re-implementation steps, parameter settings, and explicit confirmation that the same attack prompts were used for all comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical metrics on fixed test set with no derivations or self-referential definitions

full rationale

The manuscript presents an empirical security framework evaluated on a fixed collection of 5,080 samples. No equations, fitted parameters, uniqueness theorems, or derivation chains appear in the provided text. The central claim (ASR reduction from 71.4% to 11.3%) is a direct measurement on the reported test distribution rather than a quantity defined in terms of itself or obtained by renaming a fitted input. Self-citations are absent from the abstract and evaluation description. The evaluation methodology is therefore self-contained against external benchmarks and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or new postulated entities are described in the abstract; the work is an empirical engineering framework whose assumptions are implicit in the evaluation design.

pith-pipeline@v0.9.1-grok · 5838 in / 1389 out tokens · 22688 ms · 2026-06-26T19:58:19.469252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 17 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4o system card

    OpenAI, “GPT-4o system card.”https:/ /openai.com/index /gpt-4o-system-card, 2024

  2. [2]

    Llama 3 model card

    Meta AI, “Llama 3 model card.”https:/ /ai.meta.com/blog /meta-llama-3/, 2024

  3. [3]

    Mistral 7B

    A. Q. Jianget al., “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023

  4. [4]

    Retrieval-augmented generation for knowledge-intensiveNLPtasks,

    P. Lewiset al., “Retrieval-augmented generation for knowledge-intensiveNLPtasks,”inAdvancesinNeural Information Processing Systems (NeurIPS), 2020

  5. [5]

    OWASP top 10 for large language model applications

    OWASP Foundation, “OWASP top 10 for large language model applications.”https:/ /owasp.org/www-pr oject-top-10-for-large-language-model-applications/, 2023

  6. [6]

    Ignore Previous Prompt: Attack Techniques For Language Models

    F. Perez and M. T. Ribeiro, “Ignore previous prompt: Attack techniques for language models.” arXiv preprint arXiv:2211.09527, 2022

  7. [7]

    Prompt injection attacks against GPT-3

    S. Willison, “Prompt injection attacks against GPT-3.” https:/ /simonwillison.net/2022/Sep/12/prompt-injection/, 2022

  8. [8]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    K. Greshakeet al., “Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,”arXiv preprint arXiv:2302.12173, 2023

  9. [9]

    Evaluating the susceptibility of pre-trained language models via handcrafted adversarialexamples,

    H. Branchet al., “Evaluating the susceptibility of pre-trained language models via handcrafted adversarialexamples,”arXivpreprintarXiv:2209.02128, 2022

  10. [10]

    Formalizing and Benchmarking Prompt Injection Attacks and Defenses

    Y. Liuet al., “Prompt injection attacks and defenses in LLM-integrated applications,”arXiv preprint arXiv:2310.12815, 2023

  11. [11]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    E. Wallaceet al., “The instruction hierarchy: Training LLMs to prioritise privileged instructions,”arXiv preprint arXiv:2404.13208, 2024

  12. [12]

    NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

    T. Rebedeaet al., “NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails.” arXiv preprint arXiv:2310.10501, 2023

  13. [13]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H.Inanetal.,“LlamaGuard: LLM-basedinput-output safeguard for human-AI conversations.” arXiv preprint arXiv:2312.06674, 2023

  14. [14]

    Do llms know they are being tested? evaluation awareness and incentive-sensitive failures in gpt-oss-20b,

    N. Ahmed, M. I. Zaman, G. Saleem, and A. Hassan, “Do llms know they are being tested? evaluation awareness and incentive-sensitive failures in gpt-oss-20b,”arXiv preprint arXiv:2510.08624, 2025

  15. [15]

    Improving arabicmulti-label emotion classificationusing stacked embeddings and hybrid loss function,

    Y. Xu, M. A. Aslam, W. Jun, N. Ahmed, M. I. Zaman, M. Hamza, and S. Aslam, “Improving arabicmulti-label emotion classificationusing stacked embeddings and hybrid loss function,”IEEE Access, 2025

  16. [16]

    Benchmarking and defending against indirect prompt injection attacks on large language models,

    J. Yiet al., “Benchmarking and defending against indirect prompt injection attacks on large language models,”arXiv preprint arXiv:2312.14197, 2023

  17. [17]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    X. Liuet al., “Jailbreaking ChatGPT via prompt engineering: An empirical study,”arXiv preprint arXiv:2305.13860, 2023

  18. [18]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zouet al., “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  19. [19]

    Baselinedefensesforadversarialattacks against aligned language models,

    N.Jainetal.,“Baselinedefensesforadversarialattacks against aligned language models,” 2023

  20. [20]

    SmoothLLM: Defending large language models against jailbreaking attacks,

    A. Robeyet al., “SmoothLLM: Defending large language models against jailbreaking attacks,” 2023

  21. [21]

    StruQ: Defending against prompt injection with structured queries,

    S. Chenet al., “StruQ: Defending against prompt injection with structured queries,” 2024

  22. [22]

    Jailbreak and guard aligned language models with only few in-context demonstrations, 2024

    Z. Wuet al., “Defending ChatGPT against jailbreak attack via self-reminder.” arXiv preprint arXiv:2310.06387, 2023

  23. [23]

    PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,

    K. Zhuet al., “PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,” 2023

  24. [24]

    IgnorethistitleandHackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition

    S.Schulhoffetal.,“IgnorethistitleandHackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition.” arXiv preprint arXiv:2311.16119, 2023

  25. [25]

    Shostack,Threat Modeling: Designing for Security

    A. Shostack,Threat Modeling: Designing for Security. Wiley, 2014

  26. [26]

    Promptsshouldnotbeseenassecrets: Systematically measuring prompt extraction attack success

    Y.Zhangetal., “Promptsshouldnotbeseenassecrets: Systematically measuring prompt extraction attack success.” arXiv preprint arXiv:2307.06865, 2023

  27. [27]

    Poisoning language models during instruction tuning,

    A. Wanet al., “Poisoning language models during instruction tuning,” 2023

  28. [28]

    Abusing images and sounds for indirect instruction injection in multi-modal LLMs,

    E. Bagdasaryan and V. Shmatikov, “Abusing images and sounds for indirect instruction injection in multi-modal LLMs,” 2023

  29. [29]

    Constitutional AI: Harmlessness from AI Feedback

    Anthropic, “Constitutional AI: Harmlessness from AI feedback.” arXiv preprint arXiv:2212.08073, 2023

  30. [30]

    Dense passage retrieval for open-domain question answering,

    V. Karpukhinet al., “Dense passage retrieval for open-domain question answering,” inProceedings of EMNLP, 2020

  31. [31]

    Billion-scale similarity search with GPUs,

    J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, 2019

  32. [32]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    T. Wolfet al., “HuggingFace transformers: State-of-the-art natural language processing.” arXiv preprint arXiv:1910.03771, 2020

  33. [33]

    Sentence-BERT: 19 ICCK T ransactions on Information Security and Cryptography Sentence embeddings using siamese BERT-networks,

    N. Reimers and I. Gurevych, “Sentence-BERT: 19 ICCK T ransactions on Information Security and Cryptography Sentence embeddings using siamese BERT-networks,” inProceedings of EMNLP, 2019

  34. [34]

    Decoupled weight decay regularisation,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularisation,” inProceedings of ICLR, 2019

  35. [35]

    Llama Guard 2: Safeguarding human-AI conversations

    Meta AI, “Llama Guard 2: Safeguarding human-AI conversations.”https:/ /ai.meta.com/research/publications/ll ama-guard-2/, 2024

  36. [36]

    MS MARCO: A human generated machine reading comprehension dataset,

    T. Nguyenet al., “MS MARCO: A human generated machine reading comprehension dataset,” in Proceedings of NeurIPS Workshop on Cognitive Computation, 2016

  37. [37]

    Judging LLM-as-a-judge with MT-Bench and chatbot arena,

    L. Zhenget al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  38. [38]

    BERTScore: Evaluatingtextgeneration with BERT,

    T.Zhangetal.,“BERTScore: Evaluatingtextgeneration with BERT,” inProceedings of ICLR, 2020

  39. [39]

    Towards evaluating the robustnessofneuralnetworks,

    N. Carlini and D. Wagner, “Towards evaluating the robustnessofneuralnetworks,”in2017ieeesymposium on security and privacy (sp), pp. 39–57, Ieee, 2017. Data and Code AvailabilityThe datasets, code, and related resources used in the implementation and experiments of this study are available at: GitHub Repository. Funding This study did not receive any speci...