Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs

Amin Nikanjam; Arghavan Moradi Dakhel; Foutse Khomh; Kawser Wazed Nafi; Saeid Jamshidi

arxiv: 2606.10322 · v1 · pith:H2HLLXR6new · submitted 2026-06-09 · 💻 cs.CR · cs.MA

Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs

Saeid Jamshidi , Amin Nikanjam , Arghavan Moradi Dakhel , Kawser Wazed Nafi , Foutse Khomh This is my paper

Pith reviewed 2026-06-27 13:03 UTC · model grok-4.3

classification 💻 cs.CR cs.MA

keywords LLM securityprompt injectioncontext poisoningmulti-agent controlgame-theoretic protocolcontextual reasoningadversarial robustnessself-healing systems

0 comments

The pith

GT-MCP coordinates three LLM agents with a trust function to bound contextual drift in 99.6 percent of turns under adaptive adversarial attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that treating context management in LLMs as a closed-loop dynamical process can prevent gradual distortion from prompt-injection and context-poisoning attacks. It does so by coordinating three heterogeneous agents whose outputs pass through a trust function that checks causal consistency against a validated context graph, semantic agreement, and distributional drift, then rolls back when instability appears. A sympathetic reader would care because existing output filters ignore how context evolves across turns, leaving long-horizon reasoning exposed even when individual responses look plausible. The reported results show drift bounded in nearly all turns with no controller-level injections succeeding.

Core claim

GT-MCP coordinates three heterogeneous LLM agents and selects outputs through a trust function that jointly evaluates causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time. When instability is detected, a rollback-based self-healing mechanism restores the validated context and prevents unsupported fragments from propagating.

What carries the argument

The trust function that combines causal consistency, semantic agreement, and distributional drift checks to select agent outputs and trigger rollbacks.

If this is right

Contextual drift remains bounded in 99.6 percent of interaction turns.
Recovery via rollback is required in only 0.4 percent of turns.
Per-turn utility stays tightly concentrated with severe degradation in only 0.4 percent of cases.
No injection attempt succeeds at the controller level.
Selected outputs maintain stable win rates above 98 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same closed-loop controller structure could be tested on sequential reasoning tasks outside language models, such as tool-use chains or planning agents.
The validated context graph would need an independent verification method if the initial context itself contains errors.
The low reported latency overhead suggests the method could run continuously without changing user-perceived response times.

Load-bearing premise

The trust function jointly evaluating causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time accurately detects instability and enables effective rollback without allowing adversarial fragments to propagate.

What would settle it

An adaptive adversary that causes contextual drift to exceed bounds in more than 1 percent of turns or succeeds in propagating an injection at the controller level would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.10322 by Amin Nikanjam, Arghavan Moradi Dakhel, Foutse Khomh, Kawser Wazed Nafi, Saeid Jamshidi.

**Figure 1.** Figure 1: Layered GT-MCP architecture for closed-loop context stabilization. The controller separates validated context from untrusted [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of selected-output contextual drift across [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 4.** Figure 4: Utility distribution of selected outputs by agent. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Trust-score distributions by agent. Differences in dis [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized multi-objective profile of the three LLM [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of per-turn utility across the 500-turn GT [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Latency–utility relationship across 500 interaction turns. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Token usage versus median utility. latency values. Table XVIII summarizes latency-normalized indicators. Latency per token remains tightly concentrated, with a mean of 0.00163 s and a standard deviation of 0.00021 s. Utility per second remains close to zero, consistent with the earlier bounded utility distribution. Stability per second remains positive, suggesting that additional computation is primarily d… view at source ↗

**Figure 10.** Figure 10: Empirical reasoning space defined by causal consistency [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Manifold alignment score versus utility. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Large Language Models (LLMs) in multi-turn interactions maintain evolving context rather than generating isolated responses, making them vulnerable to prompt-injection and context-poisoning attacks in which locally plausible adversarial fragments gradually distort reasoning trajectories. Existing defenses mainly filter individual outputs and often ignore context evolution across turns, leaving long-horizon reasoning exposed. Although the Model Context Protocol (MCP) standardizes context exchange and tool invocation, it functions as a passive routing layer and does not enforce contextual stability. To address these limitations, we introduce the Game-Theoretic Secure Model Context Protocol (GT-MCP), a controller-driven multi-agent method that treats context management as a closed-loop dynamical process. GT-MCP coordinates three heterogeneous LLM agents and selects outputs through a trust function that jointly evaluates causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time. When instability is detected, a rollback-based self-healing mechanism restores the validated context and prevents unsupported fragments from propagating. Empirical evaluation over 500 interaction turns under an adaptive adversarial threat model shows that contextual drift remains bounded in 99.6% of turns, with recovery required in only 0.4%. Per-turn utility remains tightly concentrated, with median = -0.19, P05 = -0.72, and P95 = 0.30; severe degradation below -1 occurs in only 0.4% of cases, and no injection attempt succeeds at the controller level. Selected outputs maintain stable win rates above 98%, and computational overhead remains predictable, with latency per token = 1.63e-3 s.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GT-MCP describes a plausible multi-agent controller for LLM context stability but the abstract gives no equations, thresholds, or setup details, leaving the 99.6% claim unverifiable.

read the letter

The paper puts forward GT-MCP as a closed-loop controller that runs three LLM agents, builds a context graph, and uses a trust function to decide when to roll back. The headline result is that drift stays bounded in 99.6% of 500 turns against an adaptive adversary, with no controller-level injections succeeding.

What is actually new is the combination of game-theoretic selection, the joint trust check on causal consistency, semantic agreement, and distributional drift, plus the explicit rollback step. That moves past the passive MCP routing layer the authors mention. The problem framing is clear and the reported utility and latency numbers are at least internally consistent.

The soft spot is that none of the core machinery is shown. There are no equations for the trust function, no description of how the context graph is built or validated, no thresholds, and no account of how the adaptive attacks were implemented. The 99.6% figure therefore rests entirely on an unexamined component. Without those details or any ablation, the numbers function as an existence claim rather than evidence.

This is the sort of paper that might interest engineers working on production conversational systems, but only after the methods section appears. Most readers looking for reproducible techniques or formal grounding will find it too thin. I would not bring it to a reading group in its current form.

I would not send it to peer review until the trust function and experimental protocol are written out in enough detail to be checked.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce the Game-Theoretic Secure Model Context Protocol (GT-MCP), a controller-driven multi-agent method coordinating three heterogeneous LLM agents for context management in multi-turn interactions. It uses a trust function jointly evaluating causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time; instability triggers rollback self-healing. The central empirical claim is that over 500 interaction turns under an adaptive adversarial threat model, contextual drift remains bounded in 99.6% of turns (recovery in 0.4%), per-turn utility is stable (median -0.19, P05 -0.72, P95 0.30), severe degradation occurs in only 0.4% of cases, selected outputs maintain >98% win rates, and no injection succeeds at the controller level.

Significance. If the trust function and reported results hold with full verification, the work could provide a meaningful advance in robust LLM context handling by treating context evolution as a closed-loop dynamical process with active self-healing, going beyond passive filtering in existing protocols like MCP. The multi-agent coordination under adversarial conditions would be of interest for secure multi-turn LLM applications.

major comments (2)

[Abstract] Abstract (trust function description): The trust function is the load-bearing component for the self-healing claim and all reported performance numbers (99.6% bounded drift, 0% controller-level injections), yet it is described only qualitatively with no equations, thresholds, context-graph construction procedure, causal-consistency metric, semantic-agreement measure, or distributional-drift definition supplied. An adaptive adversary could in principle craft fragments satisfying the three criteria while still distorting downstream reasoning; nothing in the reported numbers rules this out.
[Abstract] Abstract (empirical evaluation): The abstract reports quantitative results over 500 turns (bounded drift 99.6%, utility percentiles, win rates >98%) but supplies no experimental setup, threat-model implementation details, baselines, agent role definitions, context-graph validation method, or statistical tests. This makes it impossible to verify whether the data support the stated performance.

minor comments (1)

[Abstract] The title and abstract invoke 'game-theoretic' control, but no game formulation, payoff structure, or equilibrium analysis appears; this could be clarified if the multi-agent coordination is intended to rest on such concepts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that additional detail is needed for verifiability and will revise the manuscript to address both points.

read point-by-point responses

Referee: [Abstract] Abstract (trust function description): The trust function is the load-bearing component for the self-healing claim and all reported performance numbers (99.6% bounded drift, 0% controller-level injections), yet it is described only qualitatively with no equations, thresholds, context-graph construction procedure, causal-consistency metric, semantic-agreement measure, or distributional-drift definition supplied. An adaptive adversary could in principle craft fragments satisfying the three criteria while still distorting downstream reasoning; nothing in the reported numbers rules this out.

Authors: We agree that the abstract describes the trust function only qualitatively. In the revision we will incorporate concise definitions and key equations for the trust function (including context-graph construction via causal inference on conversation history, causal-consistency metric via graph-edit distance, semantic-agreement measure via embedding cosine similarity, and distributional-drift via KL divergence on token distributions), along with the chosen thresholds. We will also add a brief discussion of how the joint evaluation is intended to counter adaptive adversaries that attempt to satisfy the criteria while distorting reasoning, supported by the reported results under the adaptive threat model. revision: yes
Referee: [Abstract] Abstract (empirical evaluation): The abstract reports quantitative results over 500 turns (bounded drift 99.6%, utility percentiles, win rates >98%) but supplies no experimental setup, threat-model implementation details, baselines, agent role definitions, context-graph validation method, or statistical tests. This makes it impossible to verify whether the data support the stated performance.

Authors: We agree that the abstract omits the experimental details. In the revision we will add a concise summary of the experimental setup to the abstract (adaptive adversarial threat model, three agent roles, context-graph validation procedure, baselines, and statistical tests with bootstrap intervals) or provide explicit cross-references to the methods section so that the performance claims can be verified. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements only

full rationale

The paper reports direct empirical outcomes (99.6% bounded drift over 500 turns, 0% controller-level injections) from running the described GT-MCP controller under an adaptive adversary. No equations, derivations, fitted parameters, or self-citations appear in the provided text. The trust function is introduced as a joint evaluator of three criteria but is not defined in terms of the reported metrics, nor are any results shown to be constructed from it by algebraic identity. The evaluation is therefore an external measurement rather than a self-referential renaming or fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or background assumptions, so the ledger cannot be populated with concrete entries.

pith-pipeline@v0.9.1-grok · 5842 in / 928 out tokens · 23070 ms · 2026-06-27T13:03:47.871254+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 5 linked inside Pith

[1]

A comprehensive overview of large language models,

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 16, no. 5, pp. 1–72, 2025

2025
[2]

When large language models meet personalization: Perspectives of challenges and opportunities,

J. Chen, Z. Liu, X. Huang, C. Wu, Q. Liu, G. Jiang, Y. Pu, Y. Lei, X. Chen, X. Wanget al., “When large language models meet personalization: Perspectives of challenges and opportunities,”World wide web, vol. 27, no. 4, p. 42, 2024

2024
[3]

From system 1 to system 2: A survey of reasoning large language models,

Z.-Z. Li, D. Zhang, M.-L. Zhang, J. Zhang, Z. Liu, Y. Yao, H. Xu, J. Zheng, P.-J. Wang, X. Chenet al., “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025

Pith/arXiv arXiv 2025
[4]

Llm applications: Current paradigms and the next frontier,

X. Hou, Y. Zhao, and H. Wang, “Llm applications: Current paradigms and the next frontier,”arXiv preprint arXiv:2503.04596, 2025

arXiv 2025
[5]

A survey on model context protocol: Architecture, state-of-the- art, challenges and future directions,

P. P. Ray, “A survey on model context protocol: Architecture, state-of-the- art, challenges and future directions,”Authorea Preprints, 2025

2025
[6]

Beyond vulnerabilities: A comprehensive survey of adversarial attacks across domains

D. C. Asimopoulos, P. Radoglou-Grammatikis, G. T. Papadopoulos, and P. Sarigiannidis, “Beyond vulnerabilities: A comprehensive survey of adversarial attacks across domains.”
[7]

Formalizing and benchmarking prompt injection attacks and defenses,

Y. Liu, Y. Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and benchmarking prompt injection attacks and defenses,” inUSENIX Security Symposium, 2024

2024
[8]

Prompt injection attacks on large language models: A survey of attack methods and defense strategies,

T. Geng, Z. Xu, and Y. Qu, “Prompt injection attacks on large language models: A survey of attack methods and defense strategies,”Computers, Materials & Continua, vol. 87, no. 1, pp. 1–10, 2025

2025
[9]

From prompt injections to protocol exploits: Threats in retrieval-augmented and llm-agent systems,

M. A. Ferraget al., “From prompt injections to protocol exploits: Threats in retrieval-augmented and llm-agent systems,”arXiv preprint arXiv:2502.XXXX, 2025

2025
[10]

Agentlab: Benchmarking llm agents against long-horizon attacks,

T. Jiang, Y. Wang, J. Liang, and T. Wang, “Agentlab: Benchmarking llm agents against long-horizon attacks,”Technical Report, 2025

2025
[11]

Struq: Defending against prompt injection with structured queries,

S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “Struq: Defending against prompt injection with structured queries,” inUSENIX Security Symposium, 2025

2025
[12]

Model context protocol (mcp): Landscape, security threats, and future research directions,

X. Hou, Y. Zhao, S. Wang, and H. Wang, “Model context protocol (mcp): Landscape, security threats, and future research directions,”ACM Transactions on Software Engineering and Methodology, 2025

2025
[13]

Prompt injection attacks in large language models and their defense mechanisms,

S. Sasalet al., “Prompt injection attacks in large language models and their defense mechanisms,” inProceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), 2025

2025
[14]

Universal and transferable adversarial attacks on aligned language models,

A. Zouet al., “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023. [Online]. Available: https://arxiv.org/abs/2307.15043

Pith/arXiv arXiv 2023
[15]

Benchmarking and defending against indirect prompt injection attacks on large language models,

Y. Liuet al., “Benchmarking and defending against indirect prompt injection attacks on large language models,”arXiv preprint arXiv:2312.14197, 2023. [Online]. Available: https://arxiv.org/abs/2312. 14197

arXiv 2023
[16]

Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models,

W. Shiet al., “Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models,”arXiv preprint arXiv:2402.07867, 2024. [Online]. Available: https://arxiv.org/abs/2402. 07867

arXiv 2024
[17]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,

F. Debenedettiet al., “Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,” arXiv preprint arXiv:2406.13352, 2024. [Online]. Available: https: //arxiv.org/abs/2406.13352 23

Pith/arXiv arXiv 2024
[18]

Targeting the core: Understanding attacks on retrieval- augmented agents,

X. Liuet al., “Targeting the core: Understanding attacks on retrieval- augmented agents,”arXiv preprint arXiv:2412.04415, 2024. [Online]. Available: https://arxiv.org/abs/2412.04415

arXiv 2024
[20]

Defeating prompt injections by design,

S. Williamset al., “Defeating prompt injections by design,” arXiv preprint arXiv:2503.18813, 2025. [Online]. Available: https: //arxiv.org/abs/2503.18813

Pith/arXiv arXiv 2025
[21]

Defending against indirect prompt injection by instruction isolation,

J. Schuteraet al., “Defending against indirect prompt injection by instruction isolation,”arXiv preprint arXiv:2505.06311, 2025. [Online]. Available: https://arxiv.org/abs/2505.06311

arXiv 2025
[22]

Design patterns for securing llm applications against prompt injections,

L. Beurer-Kellneret al., “Design patterns for securing llm applications against prompt injections,”arXiv preprint arXiv:2506.08837, 2025. [Online]. Available: https://arxiv.org/abs/2506.08837

arXiv 2025
[23]

Scalability via sparsity in stackelberg security games: An augmented decision space approach,

A. ˙Zychowski, A. Gupta, Y.-S. Ong, and J. Ma ´ndziuk, “Scalability via sparsity in stackelberg security games: An augmented decision space approach,”ACM Transactions on Evolutionary Learning, 2026

2026
[24]

Defending against indirect prompt injection attacks with spotlighting,

K. Hines, G. Lopez, M. Hall, and E. Kiciman, “Defending against indirect prompt injection attacks with spotlighting,”arXiv preprint arXiv:2403.14720, 2024

Pith/arXiv arXiv 2024
[25]

The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents,

F. Jia, T. Wu, X. Qin, and A. Squicciarini, “The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents,” arXiv preprint arXiv:2412.16682, 2024

arXiv 2024
[26]

Secinfer: Preventing prompt injection via inference-time scaling,

Y. Liu, Y. Wang, Y. Jia, and N. Z. Gong, “Secinfer: Preventing prompt injection via inference-time scaling,”arXiv preprint arXiv:2509.24967, 2025

arXiv 2025
[27]

A multi-agent llm defense pipeline against prompt injection attacks,

S. M. A. Hossain, R. K. Shayoni, M. R. Ameenet al., “A multi-agent llm defense pipeline against prompt injection attacks,”arXiv preprint arXiv:2509.14285, 2025

arXiv 2025
[28]

A systematic literature review on llm defenses against prompt injection and jailbreaking,

P. H. B. Correia, R. W. Achjian, D. E. G. Caetano de Oliveiraet al., “A systematic literature review on llm defenses against prompt injection and jailbreaking,”arXiv preprint arXiv:2601.22240, 2026

arXiv 2026

[1] [1]

A comprehensive overview of large language models,

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 16, no. 5, pp. 1–72, 2025

2025

[2] [2]

When large language models meet personalization: Perspectives of challenges and opportunities,

J. Chen, Z. Liu, X. Huang, C. Wu, Q. Liu, G. Jiang, Y. Pu, Y. Lei, X. Chen, X. Wanget al., “When large language models meet personalization: Perspectives of challenges and opportunities,”World wide web, vol. 27, no. 4, p. 42, 2024

2024

[3] [3]

From system 1 to system 2: A survey of reasoning large language models,

Z.-Z. Li, D. Zhang, M.-L. Zhang, J. Zhang, Z. Liu, Y. Yao, H. Xu, J. Zheng, P.-J. Wang, X. Chenet al., “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025

Pith/arXiv arXiv 2025

[4] [4]

Llm applications: Current paradigms and the next frontier,

X. Hou, Y. Zhao, and H. Wang, “Llm applications: Current paradigms and the next frontier,”arXiv preprint arXiv:2503.04596, 2025

arXiv 2025

[5] [5]

A survey on model context protocol: Architecture, state-of-the- art, challenges and future directions,

P. P. Ray, “A survey on model context protocol: Architecture, state-of-the- art, challenges and future directions,”Authorea Preprints, 2025

2025

[6] [6]

Beyond vulnerabilities: A comprehensive survey of adversarial attacks across domains

D. C. Asimopoulos, P. Radoglou-Grammatikis, G. T. Papadopoulos, and P. Sarigiannidis, “Beyond vulnerabilities: A comprehensive survey of adversarial attacks across domains.”

[7] [7]

Formalizing and benchmarking prompt injection attacks and defenses,

Y. Liu, Y. Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and benchmarking prompt injection attacks and defenses,” inUSENIX Security Symposium, 2024

2024

[8] [8]

Prompt injection attacks on large language models: A survey of attack methods and defense strategies,

T. Geng, Z. Xu, and Y. Qu, “Prompt injection attacks on large language models: A survey of attack methods and defense strategies,”Computers, Materials & Continua, vol. 87, no. 1, pp. 1–10, 2025

2025

[9] [9]

From prompt injections to protocol exploits: Threats in retrieval-augmented and llm-agent systems,

M. A. Ferraget al., “From prompt injections to protocol exploits: Threats in retrieval-augmented and llm-agent systems,”arXiv preprint arXiv:2502.XXXX, 2025

2025

[10] [10]

Agentlab: Benchmarking llm agents against long-horizon attacks,

T. Jiang, Y. Wang, J. Liang, and T. Wang, “Agentlab: Benchmarking llm agents against long-horizon attacks,”Technical Report, 2025

2025

[11] [11]

Struq: Defending against prompt injection with structured queries,

S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “Struq: Defending against prompt injection with structured queries,” inUSENIX Security Symposium, 2025

2025

[12] [12]

Model context protocol (mcp): Landscape, security threats, and future research directions,

X. Hou, Y. Zhao, S. Wang, and H. Wang, “Model context protocol (mcp): Landscape, security threats, and future research directions,”ACM Transactions on Software Engineering and Methodology, 2025

2025

[13] [13]

Prompt injection attacks in large language models and their defense mechanisms,

S. Sasalet al., “Prompt injection attacks in large language models and their defense mechanisms,” inProceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), 2025

2025

[14] [14]

Universal and transferable adversarial attacks on aligned language models,

A. Zouet al., “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023. [Online]. Available: https://arxiv.org/abs/2307.15043

Pith/arXiv arXiv 2023

[15] [15]

Benchmarking and defending against indirect prompt injection attacks on large language models,

Y. Liuet al., “Benchmarking and defending against indirect prompt injection attacks on large language models,”arXiv preprint arXiv:2312.14197, 2023. [Online]. Available: https://arxiv.org/abs/2312. 14197

arXiv 2023

[16] [16]

Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models,

W. Shiet al., “Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models,”arXiv preprint arXiv:2402.07867, 2024. [Online]. Available: https://arxiv.org/abs/2402. 07867

arXiv 2024

[17] [17]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,

F. Debenedettiet al., “Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,” arXiv preprint arXiv:2406.13352, 2024. [Online]. Available: https: //arxiv.org/abs/2406.13352 23

Pith/arXiv arXiv 2024

[18] [18]

Targeting the core: Understanding attacks on retrieval- augmented agents,

X. Liuet al., “Targeting the core: Understanding attacks on retrieval- augmented agents,”arXiv preprint arXiv:2412.04415, 2024. [Online]. Available: https://arxiv.org/abs/2412.04415

arXiv 2024

[19] [20]

Defeating prompt injections by design,

S. Williamset al., “Defeating prompt injections by design,” arXiv preprint arXiv:2503.18813, 2025. [Online]. Available: https: //arxiv.org/abs/2503.18813

Pith/arXiv arXiv 2025

[20] [21]

Defending against indirect prompt injection by instruction isolation,

J. Schuteraet al., “Defending against indirect prompt injection by instruction isolation,”arXiv preprint arXiv:2505.06311, 2025. [Online]. Available: https://arxiv.org/abs/2505.06311

arXiv 2025

[21] [22]

Design patterns for securing llm applications against prompt injections,

L. Beurer-Kellneret al., “Design patterns for securing llm applications against prompt injections,”arXiv preprint arXiv:2506.08837, 2025. [Online]. Available: https://arxiv.org/abs/2506.08837

arXiv 2025

[22] [23]

Scalability via sparsity in stackelberg security games: An augmented decision space approach,

A. ˙Zychowski, A. Gupta, Y.-S. Ong, and J. Ma ´ndziuk, “Scalability via sparsity in stackelberg security games: An augmented decision space approach,”ACM Transactions on Evolutionary Learning, 2026

2026

[23] [24]

Defending against indirect prompt injection attacks with spotlighting,

K. Hines, G. Lopez, M. Hall, and E. Kiciman, “Defending against indirect prompt injection attacks with spotlighting,”arXiv preprint arXiv:2403.14720, 2024

Pith/arXiv arXiv 2024

[24] [25]

The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents,

F. Jia, T. Wu, X. Qin, and A. Squicciarini, “The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents,” arXiv preprint arXiv:2412.16682, 2024

arXiv 2024

[25] [26]

Secinfer: Preventing prompt injection via inference-time scaling,

Y. Liu, Y. Wang, Y. Jia, and N. Z. Gong, “Secinfer: Preventing prompt injection via inference-time scaling,”arXiv preprint arXiv:2509.24967, 2025

arXiv 2025

[26] [27]

A multi-agent llm defense pipeline against prompt injection attacks,

S. M. A. Hossain, R. K. Shayoni, M. R. Ameenet al., “A multi-agent llm defense pipeline against prompt injection attacks,”arXiv preprint arXiv:2509.14285, 2025

arXiv 2025

[27] [28]

A systematic literature review on llm defenses against prompt injection and jailbreaking,

P. H. B. Correia, R. W. Achjian, D. E. G. Caetano de Oliveiraet al., “A systematic literature review on llm defenses against prompt injection and jailbreaking,”arXiv preprint arXiv:2601.22240, 2026

arXiv 2026