A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

Ali Hassan; Gulshan Saleem; Muhammad Imran Zaman; Nisar Ahmed

arxiv: 2606.19660 · v1 · pith:FHN3RHVUnew · submitted 2026-06-17 · 💻 cs.CR · cs.CL

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

Gulshan Saleem , Nisar Ahmed , Muhammad Imran Zaman , Ali Hassan This is my paper

Pith reviewed 2026-06-26 19:58 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords prompt injectionRAG chatbotslayered securityLLM defensesemantic anomaly detectionprovenance hierarchyattack success rate

0 comments

The pith

A three-layer framework reduces prompt injection success rate in RAG chatbots from 71.4% to 11.3%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a three-layer defense intercepts prompt injection at input screening, context assembly, and output auditing to protect RAG chatbots. Existing isolated defenses leave gaps because input filters miss poisoned documents and output monitors cannot stop payloads from reaching the model. The approach combines rule-based and semantic checks on user input, a provenance hierarchy that keeps retrieved content from overriding policy, and post-generation policy enforcement plus drift detection. Evaluation across 5080 samples on three different models shows the combined layers cut attack success while holding false positives to 4.8% and adding only 61 ms median latency. A continuous logging loop allows the system to adapt without retraining the underlying model.

Core claim

The paper claims that a model-agnostic middleware framework using three complementary layers reduces Attack Success Rate from 71.4% to 11.3% on 5080 samples spanning GPT-4o, Llama 3, and Mistral 7B. Layer 1 applies rule patterns and a fine-tuned semantic anomaly classifier to user input. Layer 2 enforces a provenance-based instruction hierarchy during context assembly so retrieved documents cannot override operator policy. Layer 3 runs a policy rule engine and semantic drift detector on model output before delivery. Ablation results show the three layers together exceed the protection of any single layer or published guardrail system.

What carries the argument

The three-layer interception pipeline that screens input, enforces provenance hierarchy in context assembly, and audits output before delivery.

If this is right

All three layers supply complementary protection whose combined effect exceeds the sum of the parts.
The framework outperforms the best single-layer baseline by 27.3 percentage points and a published guardrail by 23.8 percentage points.
False positive rate stays at 4.8% and median added latency is 61.2 ms across the tested models.
The continuous audit loop aggregates logs to support retraining and adaptation to new attack patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deploying the middleware in existing RAG pipelines would require only configuration of the policy rules and periodic classifier updates rather than model changes.
If semantic drift detection thresholds are set too loosely, the system could miss subtle output manipulations that still achieve injection goals.
The same layered structure could be tested against related threats such as retrieval poisoning or context window overflow attacks.
Long-term maintenance cost depends on how quickly new attack patterns appear relative to the retraining cycle described in the audit loop.

Load-bearing premise

The 5080 evaluation samples and ablation studies accurately represent realistic direct and indirect prompt-injection attacks that would occur in production RAG deployments.

What would settle it

A production RAG chatbot protected by the framework that still experiences an attack success rate above 11.3% on realistic direct or indirect injections would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.19660 by Ali Hassan, Gulshan Saleem, Muhammad Imran Zaman, Nisar Ahmed.

**Figure 1.** Figure 1: Architecture of the target RAG-based chatbot, depicting two attack types marked with warning signs indicating direct injection via the user input channel and indirect injection via the external knowledge base. The architectural property that enables prompt injection is the absence of an instruction-plane and data-plane separation. The LLM processes the system prompt, retrieved documents, and the user messa… view at source ↗

**Figure 2.** Figure 2: Taxonomy of prompt injection attack vectors considered in this work. Greyed attacks are explicitly out of scope (see Section 3.5). reasons. First, it is invisible to defenses that inspect only the user turn. Second, retrieved content is often trusted implicitly by the model because it arrives via the system-controlled retrieval path [8]. Third, a single poisoned document can affect all users whose queries … view at source ↗

**Figure 3.** Figure 3: Mapping of attack vectors and sub-classes to attacker objectives (IO = Instruction Override; DE = Data Exfiltration; BM = Behavioral Manipulation). 3.5 Scope and Exclusions The following threat classes are explicitly out of scope [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: illustrates the position of each layer within the RAG inference pipeline [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Input screening process in Layer 1 combining signature-based detection and semantic anomaly analysis for prompt injection mitigation. mitigates this by enforcing an explicit privilege hierarchy at context assembly time, before the concatenated context is passed to the inference engine. 4.3.1 Instruction Hierarchy We define three privilege tiers based on the provenance of each context segment. 1. Operator T… view at source ↗

**Figure 6.** Figure 6: Layer 2 context assembly with provenance tagging and privilege-aware context integration prior to LLM inference [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Layer 3 output auditing framework. Rule-based security checks and semantic drift analysis determine whether an LLM response is delivered, redacted, blocked, or escalated for review. 4.5 Continuous Audit Loop The audit loop is not a fourth detection layer but a cross-cutting operational component that collects structured logs from all three layers and from the inference engine itself. Each log entry records… view at source ↗

**Figure 8.** Figure 8: Complete architecture of the proposed three-layer prompt injection defense framework. temperature = 0 to promote deterministic outputs across repeated evaluations. 2. Llama 3 8B Instruct [2]: deployed locally using the Hugging Face transformers library [32] on a single NVIDIA A100 40 GB GPU. 3. Mistral 7B Instruct [3]: deployed locally using the same Hugging Face transformers-based setup on a single NVIDIA… view at source ↗

**Figure 9.** Figure 9: Attack Success Rate (ASR) by attacker goal and target model. Results are reported for Instruction Override (IO), Data Exfiltration (DE), and Behavioral Manipulation (BM) attacks under six defense configurations. Lower values indicate stronger resistance to prompt injection attacks. superior individual performance is expected given that it is the only layer with direct visibility into both attack vectors an… view at source ↗

**Figure 11.** Figure 11: Median per-component latency of the full framework, broken down by model. Each bar segment corresponds to one processing step; segment widths are proportional to latency. The two semantic encoding steps (L1 Semantic Classifier and L3 Semantic Drift) dominate, together accounting for approximately 60 % of total overhead. The red dashed line marks the 100 ms operational budget; all three models remain well … view at source ↗

**Figure 10.** Figure 10: ROC curves for the Layer 1 semantic anomaly classifier (AUC = 0.942) and the Layer 3 output auditing module (AUC = 0.923). Filled circles mark the selected operating points: Layer 1 at FPR = 0.051, TPR = 0.722 and Layer 3 at FPR = 0.067, TPR = 0.702, consistent with the FPR values reported in [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-layer RAG defense combines familiar pieces in a workable middleware setup, but the evaluation leaves the big ASR drop hard to trust without more on the test samples.

read the letter

The paper's main idea is a three-layer system that screens user input, enforces provenance-based hierarchy so retrieved documents cannot override policy, audits output, and runs a retraining loop. The provenance step and the continuous audit loop are the parts that feel like they could address indirect injection more directly than single-stage tools.

The work does a reasonable job describing how the layers fit together as deployable middleware that stays model-agnostic. The reported numbers show attack success rate falling from 71.4% to 11.3% across GPT-4o, Llama 3, and Mistral 7B, with 4.8% false positives and 61 ms median overhead, and the ablations are presented as evidence that the layers add value beyond any one of them.

The soft spot is the evaluation. The abstract gives no information on how the 5,080 samples were built, what the indirect injection examples actually contained, or whether they include adaptive or obfuscated variants that would appear in real RAG corpora. Because the ablations run on the same internal set, the complementarity claim rests on unexamined ground. The stress-test concern about whether these samples represent production indirect attacks looks like it applies.

This is for teams that run RAG chatbots and need something they can add without rewriting the model. A practitioner would find the architecture description useful to think about. The claims are concrete enough that the paper deserves a serious referee to check the dataset and attack construction details in the full text.

Referee Report

3 major / 2 minor

Summary. The paper proposes a three-layer middleware framework for defending RAG-based chatbots against direct and indirect prompt injection. Layer 1 applies rule-based patterns plus a fine-tuned semantic anomaly classifier to user input; Layer 2 enforces a provenance-based instruction hierarchy during context assembly; Layer 3 audits output with a policy engine and semantic drift detector. A continuous audit loop supports retraining. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B reports ASR reduction from 71.4% to 11.3%, outperforming the best single-layer baseline by 27.3 pp and a published guardrail by 23.8 pp, with 4.8% FPR and 61.2 ms median latency overhead. Ablations claim complementary protection from the three layers.

Significance. If the evaluation methodology and sample distribution are shown to be realistic, the work would supply a practical, model-agnostic deployment path for a top-ranked LLM vulnerability, with quantitative evidence that layered defenses exceed single-stage approaches and incur modest overhead.

major comments (3)

[Evaluation] Evaluation section (and abstract): the central ASR reduction claim (71.4% → 11.3%) rests on an internal collection of 5,080 samples whose construction, attack-prompt distribution, and coverage of realistic indirect/obfuscated injections are not described. Without this information it is impossible to assess whether the reported complementarity of the three layers generalizes beyond the authors' own test distribution.
[Ablation studies] Ablation studies paragraph: all ablations are performed on the same fixed internal set used for the main results; no hold-out, external corpus, or production-traffic validation is reported. This makes the claim that "their combined effect exceeds the sum of individual contributions" dependent on the unverified representativeness of the test attacks.
[Evaluation] Baseline comparison: the manuscript states that the framework outperforms "the best single-layer baseline" and "a published guardrail system" by 27.3 and 23.8 percentage points, yet supplies no information on how those baselines were re-implemented or whether the same attack prompts were used.

minor comments (2)

[Evaluation] The abstract and evaluation section report aggregate metrics without mentioning statistical significance tests or confidence intervals on the ASR differences.
[Layer 1] No description is given of the training data or hyperparameters used for the fine-tuned semantic anomaly classifier in Layer 1.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing evaluation transparency. We address each major comment below and will revise the manuscript to provide the requested details on dataset construction, ablation methodology, and baseline implementations.

read point-by-point responses

Referee: [Evaluation] Evaluation section (and abstract): the central ASR reduction claim (71.4% → 11.3%) rests on an internal collection of 5,080 samples whose construction, attack-prompt distribution, and coverage of realistic indirect/obfuscated injections are not described. Without this information it is impossible to assess whether the reported complementarity of the three layers generalizes beyond the authors' own test distribution.

Authors: We agree that additional description of the test set is required. The 5,080 samples were assembled from public prompt-injection repositories, synthetic indirect attacks via poisoned RAG documents, and obfuscated variants drawn from recent literature. In the revised manuscript we will expand Section 4.1 with a breakdown of attack categories and proportions, example prompts, and generation methodology so that readers can evaluate coverage and generalizability. revision: yes
Referee: [Ablation studies] Ablation studies paragraph: all ablations are performed on the same fixed internal set used for the main results; no hold-out, external corpus, or production-traffic validation is reported. This makes the claim that "their combined effect exceeds the sum of individual contributions" dependent on the unverified representativeness of the test attacks.

Authors: Ablations were run on the full set to isolate each layer's marginal contribution under identical attack conditions. We acknowledge the absence of hold-out or external validation. The revision will add an explicit limitations paragraph explaining this design choice and the consistency observed across three LLMs, while noting plans for future cross-validation studies. revision: partial
Referee: [Evaluation] Baseline comparison: the manuscript states that the framework outperforms "the best single-layer baseline" and "a published guardrail system" by 27.3 and 23.8 percentage points, yet supplies no information on how those baselines were re-implemented or whether the same attack prompts were used.

Authors: Single-layer baselines were re-implemented from the original publications and adapted to the RAG pipeline; the published guardrail was evaluated on the identical 5,080 prompts. The revision will include a dedicated subsection (or appendix) describing the re-implementation steps, parameter settings, and explicit confirmation that the same attack prompts were used for all comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical metrics on fixed test set with no derivations or self-referential definitions

full rationale

The manuscript presents an empirical security framework evaluated on a fixed collection of 5,080 samples. No equations, fitted parameters, uniqueness theorems, or derivation chains appear in the provided text. The central claim (ASR reduction from 71.4% to 11.3%) is a direct measurement on the reported test distribution rather than a quantity defined in terms of itself or obtained by renaming a fitted input. Self-citations are absent from the abstract and evaluation description. The evaluation methodology is therefore self-contained against external benchmarks and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or new postulated entities are described in the abstract; the work is an empirical engineering framework whose assumptions are implicit in the evaluation design.

pith-pipeline@v0.9.1-grok · 5838 in / 1389 out tokens · 22688 ms · 2026-06-26T19:58:19.469252+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 17 canonical work pages · 9 internal anchors

[1]

GPT-4o system card

OpenAI, “GPT-4o system card.”https:/ /openai.com/index /gpt-4o-system-card, 2024

2024
[2]

Llama 3 model card

Meta AI, “Llama 3 model card.”https:/ /ai.meta.com/blog /meta-llama-3/, 2024

2024
[3]

Mistral 7B

A. Q. Jianget al., “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Retrieval-augmented generation for knowledge-intensiveNLPtasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge-intensiveNLPtasks,”inAdvancesinNeural Information Processing Systems (NeurIPS), 2020

2020
[5]

OWASP top 10 for large language model applications

OWASP Foundation, “OWASP top 10 for large language model applications.”https:/ /owasp.org/www-pr oject-top-10-for-large-language-model-applications/, 2023

2023
[6]

Ignore Previous Prompt: Attack Techniques For Language Models

F. Perez and M. T. Ribeiro, “Ignore previous prompt: Attack techniques for language models.” arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Prompt injection attacks against GPT-3

S. Willison, “Prompt injection attacks against GPT-3.” https:/ /simonwillison.net/2022/Sep/12/prompt-injection/, 2022

2022
[8]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshakeet al., “Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,”arXiv preprint arXiv:2302.12173, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Evaluating the susceptibility of pre-trained language models via handcrafted adversarialexamples,

H. Branchet al., “Evaluating the susceptibility of pre-trained language models via handcrafted adversarialexamples,”arXivpreprintarXiv:2209.02128, 2022

work page arXiv 2022
[10]

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Y. Liuet al., “Prompt injection attacks and defenses in LLM-integrated applications,”arXiv preprint arXiv:2310.12815, 2023

work page arXiv 2023
[11]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

E. Wallaceet al., “The instruction hierarchy: Training LLMs to prioritise privileged instructions,”arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

T. Rebedeaet al., “NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails.” arXiv preprint arXiv:2310.10501, 2023

work page arXiv 2023
[13]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H.Inanetal.,“LlamaGuard: LLM-basedinput-output safeguard for human-AI conversations.” arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Do llms know they are being tested? evaluation awareness and incentive-sensitive failures in gpt-oss-20b,

N. Ahmed, M. I. Zaman, G. Saleem, and A. Hassan, “Do llms know they are being tested? evaluation awareness and incentive-sensitive failures in gpt-oss-20b,”arXiv preprint arXiv:2510.08624, 2025

work page arXiv 2025
[15]

Improving arabicmulti-label emotion classificationusing stacked embeddings and hybrid loss function,

Y. Xu, M. A. Aslam, W. Jun, N. Ahmed, M. I. Zaman, M. Hamza, and S. Aslam, “Improving arabicmulti-label emotion classificationusing stacked embeddings and hybrid loss function,”IEEE Access, 2025

2025
[16]

Benchmarking and defending against indirect prompt injection attacks on large language models,

J. Yiet al., “Benchmarking and defending against indirect prompt injection attacks on large language models,”arXiv preprint arXiv:2312.14197, 2023

work page arXiv 2023
[17]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

X. Liuet al., “Jailbreaking ChatGPT via prompt engineering: An empirical study,”arXiv preprint arXiv:2305.13860, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zouet al., “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Baselinedefensesforadversarialattacks against aligned language models,

N.Jainetal.,“Baselinedefensesforadversarialattacks against aligned language models,” 2023

2023
[20]

SmoothLLM: Defending large language models against jailbreaking attacks,

A. Robeyet al., “SmoothLLM: Defending large language models against jailbreaking attacks,” 2023

2023
[21]

StruQ: Defending against prompt injection with structured queries,

S. Chenet al., “StruQ: Defending against prompt injection with structured queries,” 2024

2024
[22]

Jailbreak and guard aligned language models with only few in-context demonstrations, 2024

Z. Wuet al., “Defending ChatGPT against jailbreak attack via self-reminder.” arXiv preprint arXiv:2310.06387, 2023

work page arXiv 2023
[23]

PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhuet al., “PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,” 2023

2023
[24]

IgnorethistitleandHackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition

S.Schulhoffetal.,“IgnorethistitleandHackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition.” arXiv preprint arXiv:2311.16119, 2023

work page arXiv 2023
[25]

Shostack,Threat Modeling: Designing for Security

A. Shostack,Threat Modeling: Designing for Security. Wiley, 2014

2014
[26]

Promptsshouldnotbeseenassecrets: Systematically measuring prompt extraction attack success

Y.Zhangetal., “Promptsshouldnotbeseenassecrets: Systematically measuring prompt extraction attack success.” arXiv preprint arXiv:2307.06865, 2023

work page arXiv 2023
[27]

Poisoning language models during instruction tuning,

A. Wanet al., “Poisoning language models during instruction tuning,” 2023

2023
[28]

Abusing images and sounds for indirect instruction injection in multi-modal LLMs,

E. Bagdasaryan and V. Shmatikov, “Abusing images and sounds for indirect instruction injection in multi-modal LLMs,” 2023

2023
[29]

Constitutional AI: Harmlessness from AI Feedback

Anthropic, “Constitutional AI: Harmlessness from AI feedback.” arXiv preprint arXiv:2212.08073, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Dense passage retrieval for open-domain question answering,

V. Karpukhinet al., “Dense passage retrieval for open-domain question answering,” inProceedings of EMNLP, 2020

2020
[31]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, 2019

2019
[32]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

T. Wolfet al., “HuggingFace transformers: State-of-the-art natural language processing.” arXiv preprint arXiv:1910.03771, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1910
[33]

Sentence-BERT: 19 ICCK T ransactions on Information Security and Cryptography Sentence embeddings using siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: 19 ICCK T ransactions on Information Security and Cryptography Sentence embeddings using siamese BERT-networks,” inProceedings of EMNLP, 2019

2019
[34]

Decoupled weight decay regularisation,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularisation,” inProceedings of ICLR, 2019

2019
[35]

Llama Guard 2: Safeguarding human-AI conversations

Meta AI, “Llama Guard 2: Safeguarding human-AI conversations.”https:/ /ai.meta.com/research/publications/ll ama-guard-2/, 2024

2024
[36]

MS MARCO: A human generated machine reading comprehension dataset,

T. Nguyenet al., “MS MARCO: A human generated machine reading comprehension dataset,” in Proceedings of NeurIPS Workshop on Cognitive Computation, 2016

2016
[37]

Judging LLM-as-a-judge with MT-Bench and chatbot arena,

L. Zhenget al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[38]

BERTScore: Evaluatingtextgeneration with BERT,

T.Zhangetal.,“BERTScore: Evaluatingtextgeneration with BERT,” inProceedings of ICLR, 2020

2020
[39]

Towards evaluating the robustnessofneuralnetworks,

N. Carlini and D. Wagner, “Towards evaluating the robustnessofneuralnetworks,”in2017ieeesymposium on security and privacy (sp), pp. 39–57, Ieee, 2017. Data and Code AvailabilityThe datasets, code, and related resources used in the implementation and experiments of this study are available at: GitHub Repository. Funding This study did not receive any speci...

2017

[1] [1]

GPT-4o system card

OpenAI, “GPT-4o system card.”https:/ /openai.com/index /gpt-4o-system-card, 2024

2024

[2] [2]

Llama 3 model card

Meta AI, “Llama 3 model card.”https:/ /ai.meta.com/blog /meta-llama-3/, 2024

2024

[3] [3]

Mistral 7B

A. Q. Jianget al., “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Retrieval-augmented generation for knowledge-intensiveNLPtasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge-intensiveNLPtasks,”inAdvancesinNeural Information Processing Systems (NeurIPS), 2020

2020

[5] [5]

OWASP top 10 for large language model applications

OWASP Foundation, “OWASP top 10 for large language model applications.”https:/ /owasp.org/www-pr oject-top-10-for-large-language-model-applications/, 2023

2023

[6] [6]

Ignore Previous Prompt: Attack Techniques For Language Models

F. Perez and M. T. Ribeiro, “Ignore previous prompt: Attack techniques for language models.” arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Prompt injection attacks against GPT-3

S. Willison, “Prompt injection attacks against GPT-3.” https:/ /simonwillison.net/2022/Sep/12/prompt-injection/, 2022

2022

[8] [8]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshakeet al., “Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,”arXiv preprint arXiv:2302.12173, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Evaluating the susceptibility of pre-trained language models via handcrafted adversarialexamples,

H. Branchet al., “Evaluating the susceptibility of pre-trained language models via handcrafted adversarialexamples,”arXivpreprintarXiv:2209.02128, 2022

work page arXiv 2022

[10] [10]

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Y. Liuet al., “Prompt injection attacks and defenses in LLM-integrated applications,”arXiv preprint arXiv:2310.12815, 2023

work page arXiv 2023

[11] [11]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

E. Wallaceet al., “The instruction hierarchy: Training LLMs to prioritise privileged instructions,”arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

T. Rebedeaet al., “NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails.” arXiv preprint arXiv:2310.10501, 2023

work page arXiv 2023

[13] [13]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H.Inanetal.,“LlamaGuard: LLM-basedinput-output safeguard for human-AI conversations.” arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Do llms know they are being tested? evaluation awareness and incentive-sensitive failures in gpt-oss-20b,

N. Ahmed, M. I. Zaman, G. Saleem, and A. Hassan, “Do llms know they are being tested? evaluation awareness and incentive-sensitive failures in gpt-oss-20b,”arXiv preprint arXiv:2510.08624, 2025

work page arXiv 2025

[15] [15]

Improving arabicmulti-label emotion classificationusing stacked embeddings and hybrid loss function,

Y. Xu, M. A. Aslam, W. Jun, N. Ahmed, M. I. Zaman, M. Hamza, and S. Aslam, “Improving arabicmulti-label emotion classificationusing stacked embeddings and hybrid loss function,”IEEE Access, 2025

2025

[16] [16]

Benchmarking and defending against indirect prompt injection attacks on large language models,

J. Yiet al., “Benchmarking and defending against indirect prompt injection attacks on large language models,”arXiv preprint arXiv:2312.14197, 2023

work page arXiv 2023

[17] [17]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

X. Liuet al., “Jailbreaking ChatGPT via prompt engineering: An empirical study,”arXiv preprint arXiv:2305.13860, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zouet al., “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Baselinedefensesforadversarialattacks against aligned language models,

N.Jainetal.,“Baselinedefensesforadversarialattacks against aligned language models,” 2023

2023

[20] [20]

SmoothLLM: Defending large language models against jailbreaking attacks,

A. Robeyet al., “SmoothLLM: Defending large language models against jailbreaking attacks,” 2023

2023

[21] [21]

StruQ: Defending against prompt injection with structured queries,

S. Chenet al., “StruQ: Defending against prompt injection with structured queries,” 2024

2024

[22] [22]

Jailbreak and guard aligned language models with only few in-context demonstrations, 2024

Z. Wuet al., “Defending ChatGPT against jailbreak attack via self-reminder.” arXiv preprint arXiv:2310.06387, 2023

work page arXiv 2023

[23] [23]

PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhuet al., “PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,” 2023

2023

[24] [24]

IgnorethistitleandHackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition

S.Schulhoffetal.,“IgnorethistitleandHackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition.” arXiv preprint arXiv:2311.16119, 2023

work page arXiv 2023

[25] [25]

Shostack,Threat Modeling: Designing for Security

A. Shostack,Threat Modeling: Designing for Security. Wiley, 2014

2014

[26] [26]

Promptsshouldnotbeseenassecrets: Systematically measuring prompt extraction attack success

Y.Zhangetal., “Promptsshouldnotbeseenassecrets: Systematically measuring prompt extraction attack success.” arXiv preprint arXiv:2307.06865, 2023

work page arXiv 2023

[27] [27]

Poisoning language models during instruction tuning,

A. Wanet al., “Poisoning language models during instruction tuning,” 2023

2023

[28] [28]

Abusing images and sounds for indirect instruction injection in multi-modal LLMs,

E. Bagdasaryan and V. Shmatikov, “Abusing images and sounds for indirect instruction injection in multi-modal LLMs,” 2023

2023

[29] [29]

Constitutional AI: Harmlessness from AI Feedback

Anthropic, “Constitutional AI: Harmlessness from AI feedback.” arXiv preprint arXiv:2212.08073, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Dense passage retrieval for open-domain question answering,

V. Karpukhinet al., “Dense passage retrieval for open-domain question answering,” inProceedings of EMNLP, 2020

2020

[31] [31]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, 2019

2019

[32] [32]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

T. Wolfet al., “HuggingFace transformers: State-of-the-art natural language processing.” arXiv preprint arXiv:1910.03771, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1910

[33] [33]

Sentence-BERT: 19 ICCK T ransactions on Information Security and Cryptography Sentence embeddings using siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: 19 ICCK T ransactions on Information Security and Cryptography Sentence embeddings using siamese BERT-networks,” inProceedings of EMNLP, 2019

2019

[34] [34]

Decoupled weight decay regularisation,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularisation,” inProceedings of ICLR, 2019

2019

[35] [35]

Llama Guard 2: Safeguarding human-AI conversations

Meta AI, “Llama Guard 2: Safeguarding human-AI conversations.”https:/ /ai.meta.com/research/publications/ll ama-guard-2/, 2024

2024

[36] [36]

MS MARCO: A human generated machine reading comprehension dataset,

T. Nguyenet al., “MS MARCO: A human generated machine reading comprehension dataset,” in Proceedings of NeurIPS Workshop on Cognitive Computation, 2016

2016

[37] [37]

Judging LLM-as-a-judge with MT-Bench and chatbot arena,

L. Zhenget al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[38] [38]

BERTScore: Evaluatingtextgeneration with BERT,

T.Zhangetal.,“BERTScore: Evaluatingtextgeneration with BERT,” inProceedings of ICLR, 2020

2020

[39] [39]

Towards evaluating the robustnessofneuralnetworks,

N. Carlini and D. Wagner, “Towards evaluating the robustnessofneuralnetworks,”in2017ieeesymposium on security and privacy (sp), pp. 39–57, Ieee, 2017. Data and Code AvailabilityThe datasets, code, and related resources used in the implementation and experiments of this study are available at: GitHub Repository. Funding This study did not receive any speci...

2017