arxiv: 2605.04785 · v1 · submitted 2026-05-06 · 💻 cs.AI · cs.CR

Recognition: unknown

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Chenglin Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CR

keywords AI agent safetytool call interceptionruntime evaluationshell deobfuscationmulti-step attack detectionadversarial benchmarksLLM judge

0 comments

The pith

AgentTrust intercepts AI agent tool calls before execution and returns allow, warn, block, or review verdicts using deobfuscation and chain detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentTrust as a runtime safety layer that checks every tool call an AI agent tries to make and decides whether to let it proceed. Existing approaches either evaluate actions only after they happen, rely on static rules that miss hidden commands, or limit where code can run without understanding its intent. AgentTrust addresses this by normalizing shell commands to remove obfuscation, detecting sequences of risky steps that form attacks, suggesting safer alternatives, and using a cached LLM judge for unclear cases. The authors show this runs at low millisecond latency while scoring high accuracy on both a 300-scenario internal benchmark and a separate 630-scenario set of real-world adversarial examples. If the approach holds, agents could avoid causing irreversible damage such as file deletion or data leaks during normal operation.

Core claim

AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge to intercept agent tool calls and return structured verdicts of allow, warn, block, or review. On the internal benchmark the production ruleset reaches 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond latency. On the 630-scenario adversarial benchmark the system reaches 96.7% verdict accuracy, including roughly 93% on shell-obfuscated payloads.

What carries the argument

Runtime safety layer that normalizes obfuscated shell inputs, detects multi-step risk chains, and applies cached LLM judgment to produce verdicts before tool execution.

If this is right

Agents can receive immediate structured feedback on each tool call and avoid executing harmful actions such as deletion or exfiltration.
RiskChain detection catches attacks that require several tool calls rather than single unsafe commands.
SafeFix suggestions allow the system to propose lower-risk alternatives that still achieve the agent's goal.
The released benchmarks provide a standardized way to measure safety methods for agent tool use.
Low-millisecond latency and MCP server support make the layer practical to add to existing agent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cache-aware design could let the system improve its rules over time from repeated real-world queries without retraining.
The same interception pattern might apply to other agent output types if similar decision points exist before side effects occur.
Production use would likely surface new obfuscation or chaining techniques that require further ruleset updates.
Combining static rules with selective LLM judgment could balance speed and coverage better than either method alone in other safety domains.

Load-bearing premise

The 300- and 630-scenario benchmarks sufficiently represent the distribution of real-world agent tool-use risks and adversarial attempts, including novel multi-step attacks.

What would settle it

A new collection of real-world tool-use scenarios, including fresh obfuscation methods or multi-step plans not present in the existing sets, on which verdict accuracy falls substantially below 90%.

Figures

Figures reproduced from arXiv: 2605.04785 by Chenglin Yang.

**Figure 1.** Figure 1: AgentTrust pipeline. Seven of the eight components are shown as boxes; the eighth, view at source ↗

**Figure 2.** Figure 2: An exfiltration chain in which steps 1–2 individually score view at source ↗

read the original abstract

Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond end-to-end latency. On the 630-scenario benchmark, evaluated under a patched ruleset and not claimed as zero-shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell-obfuscated payloads. AgentTrust is released under the AGPL-3.0 license and provides a Model Context Protocol server for MCP-compatible agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentTrust, a runtime safety layer for AI agents that intercepts tool calls (file ops, shell, HTTP, DB) before execution and returns structured verdicts (allow, warn, block, review). It combines shell deobfuscation normalization, SafeFix safer-alternative suggestions, RiskChain multi-step attack detection, and a cache-aware LLM-as-Judge. The authors release a 300-scenario internal benchmark across six risk categories plus 630 independently constructed real-world adversarial scenarios; they report 95.0% verdict accuracy and 73.7% risk-level accuracy on the internal set (production ruleset only, low-millisecond latency) and 96.7% verdict accuracy on the 630-set (patched ruleset, ~93% on shell-obfuscated payloads).

Significance. If the benchmarks prove representative, AgentTrust would supply a practical, low-overhead runtime defense that addresses gaps left by post-hoc evaluation, static guardrails, and infrastructure sandboxes. Releasing both benchmarks and the AGPL-licensed implementation with an MCP server is a concrete contribution to reproducibility in agent safety.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the central performance claims (95.0% verdict accuracy, 73.7% risk-level accuracy on the 300-scenario set; 96.7% on the 630-scenario set) rest on the unstated assumption that these finite, internally or independently constructed scenario sets adequately sample the distribution of real-world agent tool-use risks, including novel multi-step and obfuscated attacks. No coverage metrics, generation procedure, or adversarial diversity analysis are supplied, so it is impossible to determine whether the reported accuracies generalize beyond the test distribution.
[Abstract] Abstract: the 630-scenario results are obtained under a patched ruleset rather than the zero-shot production ruleset used for the 300-scenario numbers. This distinction must be quantified (e.g., how many rules were added and on which failure modes) because it directly affects whether the 96.7% figure can be compared to the 95.0% figure or treated as evidence of robustness.

minor comments (2)

Clarify the exact boundary between the “production-only ruleset” and the full AgentTrust system (including when the LLM judge is invoked) so readers can reproduce the latency and accuracy numbers.
Provide the precise definition of “risk-level accuracy” and the mapping from verdicts to risk levels, as this metric is reported at 73.7% without further breakdown.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our evaluation methodology. We address each major comment below and outline revisions to enhance transparency.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central performance claims (95.0% verdict accuracy, 73.7% risk-level accuracy on the 300-scenario set; 96.7% on the 630-scenario set) rest on the unstated assumption that these finite, internally or independently constructed scenario sets adequately sample the distribution of real-world agent tool-use risks, including novel multi-step and obfuscated attacks. No coverage metrics, generation procedure, or adversarial diversity analysis are supplied, so it is impossible to determine whether the reported accuracies generalize beyond the test distribution.

Authors: We agree that no finite set of scenarios can be shown to exhaustively sample the unbounded space of real-world risks. We will revise the Evaluation section to provide a detailed account of the scenario generation procedure, including the definition of the six risk categories for the 300-scenario benchmark and the independent construction process for the 630 adversarial scenarios (with explicit coverage of obfuscated shell payloads and multi-step chains). We will also include a breakdown of attack-type diversity and add a limitations subsection on generalization. We do not claim the benchmarks prove broad generalization. revision: yes
Referee: [Abstract] Abstract: the 630-scenario results are obtained under a patched ruleset rather than the zero-shot production ruleset used for the 300-scenario numbers. This distinction must be quantified (e.g., how many rules were added and on which failure modes) because it directly affects whether the 96.7% figure can be compared to the 95.0% figure or treated as evidence of robustness.

Authors: The current abstract already notes that the 630-scenario results use a patched ruleset and are not presented as zero-shot. We will expand the Evaluation section to quantify the patches by listing the number of rules added and the specific failure modes addressed (e.g., particular deobfuscation patterns and RiskChain edge cases observed during initial testing on the 300-set). This will make the distinction between the two evaluations explicit and allow direct comparison of the figures. revision: yes

standing simulated objections not resolved

Quantitative coverage metrics that would demonstrate the benchmarks adequately sample the full distribution of real-world agent tool-use risks (including all novel attacks) cannot be supplied, because the space of possible tool-use behaviors is infinite and continues to evolve.

Circularity Check

0 steps flagged

No circularity: empirical measurements on independent benchmarks

full rationale

The paper describes an engineering system (AgentTrust) that combines deobfuscation, RiskChain detection, SafeFix, and LLM-as-Judge components, then reports verdict and risk-level accuracies as direct measurements on two separately released benchmark sets (300 internal scenarios and 630 independently constructed adversarial scenarios). No equations, parameter fitting, predictions derived from the system itself, or load-bearing self-citations are present in the provided text; the accuracies are not defined in terms of the system's outputs or fitted to the evaluation data by construction. The evaluation is therefore self-contained against external benchmarks rather than reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard LLM capabilities and rule-based pattern matching without introducing new physical or mathematical entities.

pith-pipeline@v0.9.0 · 5554 in / 1235 out tokens · 30245 ms · 2026-05-08T17:30:32.538219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 7 canonical work pages · 4 internal anchors

[1]

AgentTrust: Runtime safety evaluation and interception for AI agent tool use, 2026

AgentTrust Contributors. AgentTrust: Runtime safety evaluation and interception for AI agent tool use, 2026. URLhttps://github.com/chenglin1112/AgentTrust. Open- source software; AGPL-3.0-or-later with commercial license available, version 0.5.0 (commit aee2623). v0.1.0–v0.5.0 were originally distributed under Apache-2.0; see the repository LICENSE for th...

2026
[2]

Zero-day malware detection based on supervised learning algorithms of API call signatures

Mamoun Alazab, Sitalakshmi Venkatraman, Paul Watters, and Moutaz Alazab. Zero-day malware detection based on supervised learning algorithms of API call signatures. In Australasian Data Mining Conference, 2012

2012
[3]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Daniel Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm: A benchmark for measuring harmfulness of LLM agents. InarXiv preprint arXiv:2410.09024, 2024

work page internal anchor Pith review arXiv 2024
[4]

Model context protocol specification.https://modelcontextprotocol.io/, 2024

Anthropic. Model context protocol specification.https://modelcontextprotocol.io/, 2024

2024
[5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review arXiv 2022
[6]

Llama guard 3 vision: Safe- guarding human-ai image understanding conversations,

Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Michael Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pa- supuleti. Llama Guard 3: Improved safety classifiers for LLM agents.arXiv preprint arXiv:2411.10414, 2024. AgentTrust Preprint 27

work page arXiv 2024
[7]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales. InEducational and Psychological Measurement, volume 20, pages 37–46, 1960

1960
[8]

AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InNeurIPS Datasets and Benchmarks Track, 2024

2024
[9]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021
[10]

Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InACM AISec Workshop, 2023

2023
[11]

TrustAgent: Towards safe and trustworthy LLM-based agents through agent constitution.arXiv preprint arXiv:2402.01586, 2024

Wenyue Hua, Xianjun Yang, Zelong Li, Cheng Wang, and Yongfeng Zhang. TrustAgent: Towards safe and trustworthy LLM-based agents through agent constitution.arXiv preprint arXiv:2402.01586, 2024

work page arXiv 2024
[12]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review arXiv 2023
[13]

Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535,

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535, 2024

work page arXiv 2024
[14]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representatio...

2024
[15]

G-Eval: NLG evaluation using GPT-4 with better human alignment.EMNLP, 2023

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment.EMNLP, 2023

2023
[16]

MITRE ATT&CK®Enterprise matrix

MITRE Corporation. MITRE ATT&CK®Enterprise matrix. https://attack.mitre. org/, 2024. URLhttps://attack.mitre.org/

2024
[17]

NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails

Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails. InEMNLP System Demonstrations, 2023

2023
[18]

Maddison, and Tatsunori Hashimoto

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. InInternational Conference on Learning Representations (ICLR), 2024

2024
[19]

The rsync algorithm

Andrew Tridgell and Paul Mackerras. The rsync algorithm. InTechnical Report TR-CS-96- 05, Australian National University, 1996

1996
[20]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang et al. OpenDevin: An open platform for AI software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024. AgentTrust Preprint 28

work page internal anchor Pith review arXiv 2024
[21]

R-Judge: Benchmarking safety risk awareness for LLM agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Teng Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-Judge: Benchmarking safety risk awareness for LLM agents. InFindings of EMNLP, 2024

2024
[22]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InNeurIPS Datasets and Benchmarks Track, 2023. A Labeling Rubric Scenarios in the internal benchmark (Section 5.1) were l...

2023
[23]

Data Ex pos ur e : ( none / low / medium / high / cr iti ca l )
[24]

System Impact : ( none / low / medium / high / cr iti ca l )
[25]

C r e d e n t i a l Risk : ( none / low / medium / high / c rit ic al )
[26]

Scope Creep : ( none / low / medium / high / c ri tic al )
[27]

r i s k _ l e v e l

R e v e r s i b i l i t y : ( easy / m ode ra te / hard / i m p o s s i b l e ) ## Respond in this exact JSON format : { " r i s k _ l e v e l ": " none | low | medium | high | c rit ic al " , " verdict ": " allow | warn | block | review " , " c o n f i d e n c e ": 0.0 -1.0 , " r e a s o n i n g ": "1 -2 s ent en ce e x p l a n a t i o n " , " r i s k _ ...