arxiv: 2603.22868 · v2 · submitted 2026-03-24 · 💻 cs.CR · cs.AI

Recognition: unknown

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Konstantinos Psounis, Rohan Sequeira, Stavros Damianakis, Umar Iqbal

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:13 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agentsruntime defenseexecution provenanceinjection attacksagent securitybenign execution boundsstructural classifier

0 comments

The pith

Agent Sentry bounds LLM agent executions from prior legitimate runs to detect and block injections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agent Sentry is a runtime defense for LLM agents that learns what counts as normal execution by studying sequences from earlier legitimate runs. It checks new actions first with a structural classifier on action order and argument provenance, then with a deterministic allowlist for sensitive values, and calls an LLM judge only for the remaining ambiguous cases. This approach stops 94.3 percent of successful injections while permitting 95.1 percent of benign executions, all without altering the agent, its tools, or the underlying LLM. A reader would care because agents perform open-ended tasks whose full behavior cannot be known in advance, so a practical way to bound them at runtime addresses a core security gap in agentic systems.

Core claim

We present Agent Sentry, a runtime defense that learns a bound on an agent's benign execution from prior legitimate executions and flags any action that falls outside this bound. Agent Sentry layers three complementary checks: a structural classifier over the sequence of actions and the provenance of each function's arguments; a deterministic allowlist check over sensitive argument values; and an LLM judge, invoked only on the residual of actions where the first two checks cannot safely decide between a legitimate new request and a carefully crafted injection.

What carries the argument

Execution provenance bound learned from prior legitimate runs, which classifies new action sequences using structural checks on provenance, sensitive-value allowlists, and selective LLM judgment.

Load-bearing premise

Prior legitimate executions capture enough of the possible benign variations to allow reliable separation from adversarial injections when the structural and allowlist checks are inconclusive.

What would settle it

Observe whether an attacker can craft an injection that produces an action sequence and argument provenance sufficiently close to the learned benign bound to evade the first two checks and fool the LLM judge, dropping the blocking rate well below 94 percent.

Figures

Figures reproduced from arXiv: 2603.22868 by Konstantinos Psounis, Rohan Sequeira, Stavros Damianakis, Umar Iqbal.

**Figure 1.** Figure 1: Agent-Sentry architecture in action: (1) A user submits a request that requires conditional tool execution, such as reading a file and transferring funds. (2) The agent’s LLM proposes a tool call, which is intercepted by Agent-Sentry. Only action tool calls are evaluated against the Functionality Graph. (3) The Intent Alignment Mechanism verifies if the action is in alignment with the user prompt, tool ca… view at source ↗

**Figure 2.** Figure 2: Agent-Sentry Utility success rate of Agent -Sentry as a function of functionality graph coverage on the Agent-Sentry Bench dataset [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Attack success rate of Agent-Sentry as a function of functionality graph coverage on the Agent -Sentry Bench dataset [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Functionality graphs excerpt from Banking [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Functionality graphs excerpt from Banking agent showing the Adversarial/Attack functionality graphs. • Constraint Injection: The selected transition A → B is injected as a hard constraint into the LLM’s prompt generation step, forcing it to create a scenario that plausibly requires that specific sequence of actions. 12.1.2 Generation Pipeline The generation process for a single trace follows these steps:… view at source ↗

**Figure 1.** Figure 1: The user requests to check a file for pay [PITH_FULL_IMAGE:figures/full_fig_p022_1.png] view at source ↗

read the original abstract

Agentic computing systems, while immensely capable, raise serious security, privacy, and safety concerns. A key issue is that the full set of functionalities offered by these systems, combined with their probabilistic execution flows, is not known beforehand. Given this lack of characterization, it is challenging to validate whether a system has successfully carried out the user's intended task or instead executed irrelevant actions, potentially as a consequence of compromise. We present \emph{Agent Sentry}, a runtime defense that learns a bound on an agent's benign execution from prior legitimate executions and flags any action that falls outside this bound. Agent Sentry layers three complementary checks: a structural classifier over the sequence of actions and the provenance of each function's arguments; a deterministic allowlist check over sensitive argument values; and an LLM judge, invoked only on the residual of actions where the first two checks cannot safely decide between a legitimate new request and a carefully crafted injection. We demonstrate the effectiveness of Agent Sentry in AgentDojo and AgentDyn by blocking 94.3\% of successful injections while allowing 95.1\% of benign executions, without modifying the agent, its tools, or the LLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agent-Sentry gives a workable three-layer runtime bound for LLM agents using execution provenance, with good benchmark results, though generalization to new benign variations is the main open question.

read the letter

Agent-Sentry is a runtime defense that learns a bound on benign LLM agent executions from past runs and applies three checks to flag outliers: a structural classifier on action sequences and argument provenance, a deterministic allowlist for sensitive values, and an LLM judge for the hard cases. The reported results on AgentDojo and AgentDyn are 94.3% injection blocking with 95.1% benign allowance, all without touching the agent, tools, or LLM.

Referee Report

2 major / 1 minor

Summary. The paper introduces Agent-Sentry, a runtime defense for LLM-based agents that learns a bound on benign execution behavior from prior legitimate runs. It combines a structural classifier over action sequences and argument provenances, a deterministic allowlist for sensitive argument values, and an LLM judge invoked only on residual cases where the first two checks are inconclusive. The central claim is that this approach detects prompt injections without modifying the agent, its tools, or the underlying LLM, demonstrated by blocking 94.3% of successful injections while allowing 95.1% of benign executions on the AgentDojo and AgentDyn benchmarks.

Significance. If the bound generalizes reliably, Agent-Sentry offers a practical, non-intrusive layer for securing agentic systems against injection attacks and anomalous behavior in settings where full execution characterization is impossible a priori. The concrete performance numbers on named external benchmarks and the layered design (structural + allowlist + LLM) are strengths; the absence of parameter fitting from evaluation data avoids obvious circularity.

major comments (2)

[Evaluation] Evaluation section: the reported 94.3% injection blocking and 95.1% benign allowance rest on the assumption that prior legitimate executions sufficiently represent the distribution of future benign variations, yet no quantitative analysis (e.g., coverage of LLM-induced stochastic paths or out-of-distribution benign traces) is provided to support generalization when the structural classifier and allowlist are inconclusive.
[Method] Section describing the LLM judge: when the first two checks cannot decide, the residual is passed to an LLM judge to distinguish legitimate new requests from injections; the manuscript provides no separate accuracy metrics or ablation for this judge on novel but benign action sequences, which is load-bearing for the overall false-positive claim.

minor comments (1)

[Abstract] Abstract states performance numbers but omits any mention of the number of prior runs used to define the bound or the exact decision thresholds; adding one sentence would improve verifiability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We appreciate the recognition of the practical strengths of the layered design and the avoidance of circularity in evaluation. We address each major comment below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported 94.3% injection blocking and 95.1% benign allowance rest on the assumption that prior legitimate executions sufficiently represent the distribution of future benign variations, yet no quantitative analysis (e.g., coverage of LLM-induced stochastic paths or out-of-distribution benign traces) is provided to support generalization when the structural classifier and allowlist are inconclusive.

Authors: We agree that explicit quantitative support for coverage would strengthen the generalization claim. The AgentDojo and AgentDyn benchmarks already incorporate diverse tasks with LLM stochasticity across multiple runs, yielding varied action sequences and provenance patterns. In the revised manuscript we will add a new subsection to the Evaluation section that reports metrics on the diversity of observed benign traces (e.g., unique action-sequence counts, provenance-pattern coverage) and discusses how these traces address potential out-of-distribution cases when the structural classifier and allowlist are inconclusive. revision: yes
Referee: [Method] Section describing the LLM judge: when the first two checks cannot decide, the residual is passed to an LLM judge to distinguish legitimate new requests from injections; the manuscript provides no separate accuracy metrics or ablation for this judge on novel but benign action sequences, which is load-bearing for the overall false-positive claim.

Authors: We recognize the value of isolating the LLM judge's contribution. While the overall 95.1% benign allowance is reported, we will add an ablation study in the revised manuscript that evaluates the judge independently on held-out novel benign action sequences. This will include accuracy, precision, and recall figures for the judge alone, together with an analysis of how often it is invoked and its impact on the residual false-positive rate. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper learns a bound on benign execution provenance from prior legitimate runs, then applies layered checks (structural classifier, allowlist, residual LLM judge) and reports effectiveness on separate external benchmarks AgentDojo and AgentDyn. The 94.3% injection blocking and 95.1% benign allowance figures are measured on held-out test executions rather than being algebraically or statistically forced by the training data used to define the bound. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the central claim to a renaming or self-definition of its inputs. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that historical benign executions suffice to characterize all future legitimate behavior.

axioms (1)

domain assumption Prior legitimate executions provide a sufficient sample to bound all future benign behaviors
The system learns bounds from prior executions assuming they cover normal variations.

pith-pipeline@v0.9.0 · 5505 in / 1105 out tokens · 50446 ms · 2026-05-15T01:13:41.867354+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
cs.CR 2026-05 unverdicted novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
Towards Security-Auditable LLM Agents: A Unified Graph Representation
cs.AI 2026-05 unverdicted novelty 6.0

Agent-BOM is a unified hierarchical attributed directed graph that models static capability bases and dynamic semantic states of LLM agents for path-level security auditing and risk assessment.
SoK: Security of Autonomous LLM Agents in Agentic Commerce
cs.CR 2026-04 unverdicted novelty 5.0

The paper systematizes security for LLM agents in agentic commerce into five threat dimensions, identifies 12 cross-layer attack vectors, and proposes a layered defense architecture.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 3 Pith papers · 5 internal anchors

[1]

Get my drift? catching llm task drift with activa- tion deltas, 2025

Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew Paverd. Get my drift? catching llm task drift with activa- tion deltas, 2025

work page 2025
[2]

Amazon alexa.https://developer

Amazon. Amazon alexa.https://developer. amazon.com/en-US/alexa, 2014

work page 2014
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Defending against alignment-breaking at- tacks via robustly aligned llm

Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. Defending against alignment-breaking at- tacks via robustly aligned llm. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 10542–10560, 2024

work page 2024
[5]

LangChain, a framework for building agents and LLM-powered applications

Harrison Chase. LangChain, a framework for building agents and LLM-powered applications. https://github.com/langchain-ai/langchain, Octo- ber 2022

work page 2022
[6]

Defeating prompt injections by design, 2025

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, 13 and Florian Tram` er. Defeating prompt injections by design, 2025

work page 2025
[7]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fis- cher, and Florian Tram` er. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Building guardrails for large lan- guage models.arXiv preprint arXiv:2402.01822, 2024

Yi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, and Xiaowei Huang. Building guardrails for large lan- guage models.arXiv preprint arXiv:2402.01822, 2024

work page arXiv 2024
[9]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial ex- amples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[10]

De- fending against indirect prompt injection attacks with spotlighting, 2024

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. De- fending against indirect prompt injection attacks with spotlighting, 2024

work page 2024
[11]

Hsu, and Pin-Yu Chen

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, and Pin-Yu Chen. Attention tracker: Detecting prompt injection at- tacks in llms. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025

work page 2025
[12]

The new attack surface: Why AI agents need taint analysis, 2024

Colin Nwachukwu Ife. The new attack surface: Why AI agents need taint analysis, 2024. Geordie AI. Accessed: 2026-01-09

work page 2024
[13]

Llm platform security: Applying a sys- tematic evaluation framework to openai’s chatgpt plugins

Umar Iqbal, Tadayoshi Kohno, and Franziska Roesner. Llm platform security: Applying a sys- tematic evaluation framework to openai’s chatgpt plugins. InProceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society, pages 611– 623, 2024

work page 2024
[14]

Ai agents arrive at citi

The Wall Street Journal. Ai agents arrive at citi. https://www.wsj.com/articles/ai-agents- arrive-at-citi-60a3559d, 2025. Written by Isabelle Bousquette

work page 2025
[15]

Digi- tal workers have arrived in banking

The Wall Street Journal. Digi- tal workers have arrived in banking. https://www.wsj.com/articles/digital-workers- have-arrived-in-banking-bf62be49, 2025. Written by Isabelle Bousquette

work page 2025
[16]

Palisade – prompt injection detection framework, 2024

Sahasra Kokkula, Somanathan R, Nandavardhan R, Aashishkumar, and G Divya. Palisade – prompt injection detection framework, 2024

work page 2024
[17]

We’re afraid language models aren’t modeling am- biguity

Alisa Liu, Zhaofeng Wu, Julian Michael, Alane Suhr, Peter West, Alexander Koller, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. We’re afraid language models aren’t modeling am- biguity. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 790–807, 2023

work page 2023
[18]

Llamaindex.https://github.com/ run-llama/llama_index, 2022

Jerry Liu. Llamaindex.https://github.com/ run-llama/llama_index, 2022

work page 2022
[19]

Traceaegis: Securing llm-based agents via hierarchical and behavioral anomaly detection, 2025

Jiahao Liu, Bonan Ruan, Xianglin Yang, Zhiwei Lin, Yan Liu, Yang Wang, Tao Wei, and Zhenkai Liang. Traceaegis: Securing llm-based agents via hierarchical and behavioral anomaly detection, 2025

work page 2025
[20]

Evolv- ing diverse red-team language models in multi- round multi-agent games, 2024

Chengdong Ma, Ziran Yang, Hai Ci, Jun Gao, Min- quan Gao, Xuehai Pan, and Yaodong Yang. Evolv- ing diverse red-team language models in multi- round multi-agent games, 2024

work page 2024
[21]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To- wards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Efficient tcb reduction and attestation

Jonathan McCune, Ning Qu, Yanlin Li, Anupam Datta, Virgil Gligor, and Adrian Perrig. Efficient tcb reduction and attestation. 01 2009

work page 2009
[23]

Openai agents sdk.https://github

OpenAI. Openai agents sdk.https://github. com/openai/openai-agents-python, 2025

work page 2025
[24]

India’s apollo hospitals bets on ai to tackle staff workload

Reuters. India’s apollo hospitals bets on ai to tackle staff workload. https://www.reuters.com/business/healthcare- pharmaceuticals/indias-apollo-hospitals-bets-ai- tackle-staff-workload-2025-03-13/, 2025

work page 2025
[25]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox, 2023. arXiv:2309.15817

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Progent: Programmable privilege control for llm agents, 2025

Tianneng Shi, Jingxuan He, Zhun Wang, Hong- wei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable privilege control for llm agents, 2025

work page 2025
[27]

Microsoft’s new dragon copi- lot is an ai assistant for healthcare

The Verge. Microsoft’s new dragon copi- lot is an ai assistant for healthcare. https://www.theverge.com/news/622528/microsoft- dragon-copilot-ai-healthcare-assistant, 2025

work page 2025
[28]

Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection, 2025

Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, and Ye Wu. Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection, 2025

work page 2025
[29]

Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic. Agents. Whitepaper. Available online: 14 https://www.rojo.me/content/files/2025/ 01/Whitepaper-Agents---Google.pdf, 2024. Accessed: 2024-12-14

work page 2025
[30]

The dual llm pattern for building ai assistants that can reason safely about untrusted data.https://simonwillison.net/2023/Apr/ 25/dual-llm-pattern/, 2023

Simon Willison. The dual llm pattern for building ai assistants that can reason safely about untrusted data.https://simonwillison.net/2023/Apr/ 25/dual-llm-pattern/, 2023. Accessed: 2025-02- 05

work page 2023
[31]

IsolateGPT: An execution isolation architecture for LLM-based agentic systems

Yuhao Wu, Franziska Jiang, et al. IsolateGPT: An execution isolation architecture for LLM-based agentic systems. InNetwork and Distributed Sys- tem Security Symposium (NDSS), 2025

work page 2025
[32]

Align- ment under pressure: The case for informed adver- saries when evaluating llm defenses.arXiv preprint arXiv:2505.15738, 2025

Xiaoxue Yang, Bozhidar Stevanoski, Matthieu Meeus, and Yves-Alexandre de Montjoye. Align- ment under pressure: The case for informed adver- saries when evaluating llm defenses.arXiv preprint arXiv:2505.15738, 2025

work page arXiv 2025
[33]

Autodefense: Multi-agent llm defense against jailbreak attacks, 2024

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi-agent llm defense against jailbreak attacks, 2024

work page 2024
[34]

Kindly analyze the document ‘in- voice dec.txt’. If my current account balance is suffi- cient to cover the total, please proceed with the transfer immediately

Jingyan Zhou, Kun Li, Junan Li, Jiawen Kang, Minda Hu, Xixin Wu, and Helen Meng. Purple- teaming llms with adversarial defender training, 2024. 15 APPENDIX 9 Diversity in Utility Traces To ensure the robustness of the behavioral policy, the Agent-Sentry Benchdataset prioritizes diversity within utility traces. This approach captures the wide variability i...

work page 2024
[35]

Target Tool Transition

The sequence must be logical (outputs of step 1 reasonably flow into step 2). 3. Vary the length and complexity. 4. You MUST focus on the "Target Tool Transition" provided. 12.2.2 User Prompt Generation To translate tool sequences into natural language, we use the following prompt, which enforces the selected persona: Meta-Prompt: User Prompt Generation Y...

work page
[36]

Do not just list tool names

The prompt must IMPLICITLY require the tools. Do not just list tool names. 4. ‘reasoning‘ must explain why this phrasing triggers these tools. 12.2.3 Stylistic Diversity We employ the same diversity sampling strategy as the Attack Generator to prevent linguistic mode collapse. For each trace, a style is sampled from: 19 •Standard: Professional and clear. ...

work page
[37]

read file, get balance

The prompt must IMPLICITLY require the tools. Do not just list tool names. 4. ‘reasoning‘ must explain why this phrasing triggers these tools. 13.3.2 Stylistic Diversity To prevent the model from converging on a single ”polite command” style, we stochastically sample from a set of 9 distinct personas for each generation: •Standard: Clear, polite, professi...

work page