pith. sign in

arxiv: 2605.17453 · v1 · pith:SJKJBZBOnew · submitted 2026-05-17 · 💻 cs.CR · cs.CL

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Pith reviewed 2026-05-19 23:23 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLM agentstool feedback securitycognitive poisoningTRUST-BenchVISTA-Guardfinal-action scoring
0
0 comments X

The pith

LLM agents face cognitive poisoning when tools build trust through benign feedback before executing harmful final actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a failure mode where malicious tools act plausibly during exploration to accumulate trust and then deliver harm only when hidden conditions align with the final action. It constructs TRUST-Bench with 1970 hidden-trigger episodes plus matched safe controls, introduces the GuardedJoint metric to capture real deployment risk, and proposes VISTA-Guard to abstract interactions into variables that track trust formation. Trajectory-aware scoring of the final action from this representation outperforms prompt heuristics and single-sided methods, which collapse under the balanced metric. A reader would care because tool-using agents increasingly make consequential decisions in black-box environments where feedback cannot be assumed trustworthy.

Core claim

VISTA-Guard abstracts multi-step tool interaction into structured environment variables that encode trust-formation dynamics and then scores the risk of the final executable action from this trajectory-conditioned representation, achieving 84.2 in-domain and 56.9 on balanced out-of-distribution under GuardedJoint while prompt-centric heuristics and methods optimizing only one side of the safety-utility tradeoff collapse to zero.

What carries the argument

VISTA-Guard framework for final-action risk scoring, which converts multi-step tool interactions into structured environment variables encoding trust-formation dynamics.

If this is right

  • Trajectory-aware final-action scoring yields strong in-domain discrimination against hidden-trigger compromises.
  • The approach remains effective under balanced out-of-distribution transfer to new tool behaviors.
  • Methods that optimize only safety or only utility reach zero under the GuardedJoint asymmetric penalty.
  • Prompt-centric heuristics, scalarized features, and zero-shot judges fail to detect the poisoning regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent security in tool ecosystems should target the cumulative process of trust formation rather than local prompt text or tool descriptors alone.
  • Benchmarks that include more varied hidden-state alignments could reveal whether the current environment-variable abstraction generalizes further.
  • Embedding such trajectory-conditioned scoring directly into agent loops might allow real-time intervention before harmful actions execute.

Load-bearing premise

The constructed TRUST-Bench episodes with hidden triggers and matched safe controls sufficiently represent real-world malicious tool behaviors in black-box ecosystems.

What would settle it

A set of tool interactions that build trust and trigger harm in ways not captured by the environment variables yet produce final actions the scoring still flags correctly, or conversely a case where the variables miss the risk entirely.

Figures

Figures reproduced from arXiv: 2605.17453 by Binwu Wang, Chenyang Lyu, Guanhua Chen, Lecheng Yan, Longyue Wang, Ruizhe Li, Wenxi Li, Xicheng Han.

Figure 1
Figure 1. Figure 1: Concrete TRUST-BENCH sample visualization. The figure shows one released web_search tool sample as it appears in the benchmark: (a) the tool card and executable interface, (b) a benign/malicious matched pair with the same outward API, (c) the three recorded exploratory turns, (d) the trigger rule split into required_all, required_any, and forbidden fields, and (e) the resulting execute/reject decision. The… view at source ↗
Figure 3
Figure 3. Figure 3: Overall framework and problem setup. The figure illustrates the pipeline from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: GUARDEDJOINT under ρ = 1.5. The score is high only when both independent errors, AMR and RNR, are low. For the main experiments we instantiate this family with ρ = 1.5, giving G1.5 = max(0, 100−2AMR−1.5RNR) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Exploration-budget ablation with a frozen GPT-5.4 trajectory-only judge on 170 external [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Tool-using LLM agents increasingly rely on external tools to make consequential decisions, yet most existing agent-security benchmarks and defenses implicitly assume that tool feedback is trustworthy once a tool has been selected. We study a different failure mode, cognitive poisoning, in which a malicious tool behaves plausibly during exploration, accumulates trust through benign-looking feedback, and becomes harmful only when hidden state conditions align with the final executable action. To study this setting, we construct TRUST-Bench, a task-conditioned benchmark of 1,970 hidden-trigger tool-compromise episodes with matched safe controls, introduce an asymmetric penalty metric, GuardedJoint, to better reflect real deployment risk, and present VISTA-Guard, a backbone-agnostic framework for final-action risk scoring. The core idea is to abstract multi-step tool interaction into structured environment variables that encode trust-formation dynamics and then score the risk of the final executable action from this trajectory-conditioned representation. Experiments show that prompt-centric heuristics, scalarized features, and zero-shot judges fail in this regime, whereas trajectory-aware final-action scoring yields strong in-domain discrimination and remains effective under balanced out-of-distribution transfer. Under GuardedJoint, VISTA-Guard reaches $84.2$ in-domain and $56.9$ on balanced out-of-distribution evaluation, while methods that optimize only one side of the safety--utility tradeoff collapse to zero. These findings support a broader view of agent security in black-box tool ecosystems: the decisive defense target is not local prompt text or tool descriptors alone, but the way trust is formed across the interaction trajectory and committed through the final action.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper studies cognitive poisoning in LLM agents, where malicious tools build trust via plausible feedback during exploration and activate harm only on hidden-state alignment with the final action. It introduces TRUST-Bench (1,970 task-conditioned hidden-trigger episodes with matched safe controls), the asymmetric GuardedJoint metric, and VISTA-Guard, a backbone-agnostic framework that abstracts multi-step interactions into structured environment variables encoding trust-formation dynamics to score final-action risk. Experiments claim that trajectory-aware scoring yields strong in-domain discrimination (84.2 under GuardedJoint) and balanced OOD transfer (56.9), while prompt-centric heuristics, scalarized features, and zero-shot judges collapse to zero.

Significance. If the benchmark episodes faithfully capture trust-formation dynamics in black-box tool ecosystems, the results would establish that final-action risk scoring from trajectory representations is a more robust defense target than local prompt or tool-descriptor methods. The work provides concrete performance numbers and an asymmetric metric that penalizes safety-utility imbalances, which could shift evaluation practices for agent security.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (TRUST-Bench construction): the claim that the 1,970 episodes with hidden triggers and matched safe controls represent realistic malicious tool behaviors lacks any description of task sampling, trigger insertion rules, data exclusion criteria, or statistical significance testing of the reported gains; without these, the 84.2 / 56.9 GuardedJoint numbers cannot be assessed for robustness versus post-hoc benchmark choices.
  2. [§4] §4 (GuardedJoint metric and VISTA-Guard scoring): the metric and scoring function appear jointly tuned on the same TRUST-Bench episodes used for both in-domain and OOD evaluation; the reported discrimination advantage may therefore be partly an artifact of this joint design rather than an independent property of trajectory-aware final-action scoring.
  3. [§5] §5 (OOD transfer experiments): the balanced out-of-distribution results are presented without explicit definition of how the OOD episodes differ in trigger distribution or task conditioning from the in-domain set, making it impossible to determine whether the drop from 84.2 to 56.9 reflects genuine generalization or benchmark-specific symmetry.
minor comments (2)
  1. [§4] Notation for environment variables that encode trust-formation dynamics should be introduced with an explicit table or diagram in §4 to clarify how multi-step interactions are abstracted.
  2. [Abstract, §5] The abstract states that 'methods optimizing only one side of the safety-utility tradeoff collapse to zero' but does not name the exact baselines or their GuardedJoint scores in the main text; a consolidated table would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commitments to revisions that improve transparency without altering the core claims or results of the work.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (TRUST-Bench construction): the claim that the 1,970 episodes with hidden triggers and matched safe controls represent realistic malicious tool behaviors lacks any description of task sampling, trigger insertion rules, data exclusion criteria, or statistical significance testing of the reported gains; without these, the 84.2 / 56.9 GuardedJoint numbers cannot be assessed for robustness versus post-hoc benchmark choices.

    Authors: We agree that the manuscript would benefit from greater detail on benchmark construction to support assessment of robustness. In the revision we will expand §3 with explicit descriptions of the task sampling procedure (including source benchmarks and selection criteria), the rules governing hidden-trigger insertion to model cognitive poisoning, the data exclusion criteria applied to maintain realism and avoid artifacts, and statistical significance testing (such as bootstrap confidence intervals) for the GuardedJoint scores. revision: yes

  2. Referee: [§4] §4 (GuardedJoint metric and VISTA-Guard scoring): the metric and scoring function appear jointly tuned on the same TRUST-Bench episodes used for both in-domain and OOD evaluation; the reported discrimination advantage may therefore be partly an artifact of this joint design rather than an independent property of trajectory-aware final-action scoring.

    Authors: We clarify that hyperparameter selection for the VISTA-Guard scoring function occurred on a held-out validation partition of the in-domain data, separate from the reported test episodes, while OOD episodes use distinct task conditionings and trigger distributions. To remove any ambiguity about joint design, the revised §4 will document the exact data splits, tuning protocol, and include an ablation with fixed or default parameters to confirm the trajectory-aware advantage is not an artifact of evaluation-set tuning. revision: partial

  3. Referee: [§5] §5 (OOD transfer experiments): the balanced out-of-distribution results are presented without explicit definition of how the OOD episodes differ in trigger distribution or task conditioning from the in-domain set, making it impossible to determine whether the drop from 84.2 to 56.9 reflects genuine generalization or benchmark-specific symmetry.

    Authors: We accept that the current description of OOD construction is insufficiently explicit. The revised §5 will add a dedicated subsection that quantifies the differences between in-domain and OOD episodes, including variations in trigger distributions, changes in task conditioning and domain coverage, and the balancing procedure, thereby enabling readers to evaluate whether the performance drop indicates genuine generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or evaluation chain.

full rationale

The paper constructs TRUST-Bench (1,970 episodes), defines the GuardedJoint metric, and introduces VISTA-Guard as a trajectory-aware scoring framework, then reports empirical performance numbers (84.2 in-domain, 56.9 balanced OOD) on that benchmark. This is a standard benchmark-plus-method paper structure with no equations or definitions that reduce a claimed result to its own inputs by construction. No self-citations are load-bearing for the central claims, no fitted parameters are relabeled as predictions, and the OOD transfer results supply an independent check. The derivation chain consists of explicit construction followed by measurement and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not list explicit free parameters or axioms; the approach implicitly assumes that tool feedback can be selectively malicious and that environment-variable abstraction captures trust dynamics, but these are not formalized as background results.

pith-pipeline@v0.9.0 · 5844 in / 1197 out tokens · 32803 ms · 2026-05-19T23:23:36.932928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 10 internal anchors

  1. [1]

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023

  2. [2]

    Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

    Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156, 2024

  3. [3]

    Toolsandbox: A stateful, conversational, inter- active evaluation benchmark for llm tool use capabilities

    Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. Toolsandbox: A stateful, conversational, inter- active evaluation benchmark for llm tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183, 2025

  4. [4]

    10 Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, and Yidan Liang

    Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025

  5. [5]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

  6. [6]

    Benchmarking and defending against indirect prompt injection attacks on large language models

    Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 1809–1820, 2025

  7. [7]

    Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024

  8. [8]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

  9. [9]

    Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents.arXiv preprint arXiv:2410.02644, 2024

  10. [10]

    Wang, MCPTox: A systematic benchmark for MCP server security, arXiv preprint arXiv:2508.14925 (2025)

    Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guan- quan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning attack on real-world mcp servers.arXiv preprint arXiv:2508.14925, 2025. 10

  11. [11]

    Mcp-itp: An automated framework for implicit tool poisoning in mcp.arXiv preprint arXiv:2601.07395, 2026

    Ruiqi Li, Zhiqiang Wang, Yunhao Yao, and Xiang-Yang Li. Mcp-itp: An automated framework for implicit tool poisoning in mcp.arXiv preprint arXiv:2601.07395, 2026

  12. [12]

    Invisible threats from model context protocol: Generating stealthy injection payload via tree-based adaptive search.arXiv preprint arXiv:2603.24203, 2026

    Yulin Shen, Xudong Pan, Geng Hong, and Min Yang. Invisible threats from model context protocol: Generating stealthy injection payload via tree-based adaptive search.arXiv preprint arXiv:2603.24203, 2026

  13. [13]

    Breaking the protocol: Security anal- ysis of the model context protocol specification and prompt in- jection vulnerabilities in tool-integrated llm agents,

    Narek Maloyan and Dmitry Namiot. Breaking the protocol: Security analysis of the model context protocol specification and prompt injection vulnerabilities in tool-integrated llm agents. arXiv preprint arXiv:2601.17549, 2026

  14. [14]

    Impossiblebench: Measuring llms’ propensity of exploiting test cases.arXiv preprint arXiv:2510.20270, 2025

    Ziqian Zhong, Aditi Raghunathan, and Nicholas Carlini. Impossiblebench: Measuring llms’ propensity of exploiting test cases.arXiv preprint arXiv:2510.20270, 2025

  15. [15]

    Vijayvargiya, A

    Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Gra- ham Neubig, and Maarten Sap. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134, 2025

  16. [16]

    Redteamcua: Realistic adversarial testing of computer-use agents in hybrid web-os environments.arXiv preprint arXiv:2505.21936, 2025

    Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, and Huan Sun. Redteamcua: Realistic adversarial testing of computer-use agents in hybrid web-os environments.arXiv preprint arXiv:2505.21936, 2025

  17. [17]

    Mcp-safetybench: A benchmark for safety evaluation of large language models with real-world mcp servers.arXiv preprint arXiv:2512.15163, 2025

    Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, and Chao Yang. Mcp-safetybench: A benchmark for safety evaluation of large language models with real-world mcp servers.arXiv preprint arXiv:2512.15163, 2025

  18. [18]

    Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use.arXiv preprint arXiv:2509.24002, 2025

    Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, et al. Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use.arXiv preprint arXiv:2509.24002, 2025

  19. [19]

    Redcodeagent: Automatic red-teaming agent against diverse code agents.arXiv preprint arXiv:2510.02609, 2025

    Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, and Bo Li. Redcodeagent: Automatic red-teaming agent against diverse code agents.arXiv preprint arXiv:2510.02609, 2025

  20. [20]

    Bluecodeagent: A blue teaming agent enabled by automated red teaming for codegen ai.arXiv preprint arXiv:2510.18131, 2025

    Chengquan Guo, Yuzhou Nie, Chulin Xie, Zinan Lin, Wenbo Guo, and Bo Li. Bluecodeagent: A blue teaming agent enabled by automated red teaming for codegen ai.arXiv preprint arXiv:2510.18131, 2025

  21. [21]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents.arXiv preprint arXiv:2410.09024, 2024

  22. [22]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

  23. [23]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  24. [24]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

  25. [25]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Ma- lik, Willia...

  26. [26]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  27. [27]

    Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

    Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

  28. [28]

    Evaluating Privilege Usage of Agents with Real-World Tools

    Quan Zhang, Lianhang Fu, Lvsi Lian, Gwihwan Go, Yujue Wang, Chijin Zhou, Yu Jiang, and Geguang Pu. Evaluating privilege usage of agents on real-world tools.arXiv preprint arXiv:2603.28166, 2026

  29. [29]

    Mcpshield: A security cognition layer for adaptive trust calibration in model context protocol agents.arXiv preprint arXiv:2602.14281, 2026

    Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir, Linsey Pang, Prakhar Mehrotra, Kun Wang, and Qingsong Wen. Mcpshield: A security cognition layer for adaptive trust calibration in model context protocol agents.arXiv preprint arXiv:2602.14281, 2026

  30. [30]

    The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents

    Feiran Jia, Tong Wu, Xin Qin, and Anna Squicciarini. The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29680–29697, 2025

  31. [31]

    Defeating Prompt Injections by Design

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

  32. [32]

    Toolsafe: Enhancing tool invocation safety of llm-based agents via proactive step-level guardrail and feedback.arXiv preprint arXiv:2601.10156, 2026

    Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, and Jing Shao. Toolsafe: Enhancing tool invocation safety of llm-based agents via proactive step-level guardrail and feedback.arXiv preprint arXiv:2601.10156, 2026

  33. [33]

    Unsafer in many turns: Benchmarking and defending multi-turn safety risks in tool-using agents

    Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, and Weiyan Shi. Unsafer in many turns: Benchmarking and defending multi-turn safety risks in tool-using agents. arXiv preprint arXiv:2602.13379, 2026

  34. [34]

    Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification

    Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, et al. Agentsentry: Mitigating indirect prompt injec- tion in llm agents via temporal causal diagnostics and context purification.arXiv preprint arXiv:2602.22724, 2026

  35. [35]

    Commandsans: Securing ai agents with surgical precision prompt sanitization.arXiv preprint arXiv:2510.08829, 2025

    Debeshee Das, Luca Beurer-Kellner, Marc Fischer, and Maximilian Baader. Commandsans: Securing ai agents with surgical precision prompt sanitization.arXiv preprint arXiv:2510.08829, 2025

  36. [36]

    Agentwatcher: A rule-based prompt injection monitor, 2026

    Yanting Wang, Wei Zou, Runpeng Geng, and Jinyuan Jia. Agentwatcher: A rule-based prompt injection monitor, 2026. URLhttps://arxiv.org/abs/2604.01194

  37. [37]

    Agentsys: Secure and dynamic llm agents through explicit hierarchical memory management.arXiv preprint arXiv:2602.07398, 2026

    Ruoyao Wen, Hao Li, Chaowei Xiao, and Ning Zhang. Agentsys: Secure and dynamic llm agents through explicit hierarchical memory management.arXiv preprint arXiv:2602.07398, 2026

  38. [38]

    Browsesafe: Understanding and preventing prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597, 2025

    Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, and Ninghui Li. Browsesafe: Understanding and preventing prompt injection within ai browser agents.arXiv preprint arXiv:2511.20597, 2025. 12

  39. [39]

    Learning when to act or refuse: Guarding agentic reasoning models for safe multi-step tool use.arXiv preprint arXiv:2603.03205, 2026

    Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, and Ahmed Awadallah. Learning when to act or refuse: Guarding agentic reasoning models for safe multi-step tool use.arXiv preprint arXiv:2603.03205, 2026

  40. [40]

    Edward Suh

    Chong Xiang, Drew Zagieboylo, Shaona Ghosh, Sanjay Kariyappa, Kai Greshake, Hanshen Xiao, Chaowei Xiao, and G. Edward Suh. Architecting secure ai agents: Perspectives on system-level defenses against indirect prompt injection attacks, 2026. URL https://arxiv. org/abs/2603.30016

  41. [41]

    Sok: The attack surface of agentic ai–tools, and autonomy.arXiv preprint arXiv:2603.22928, 2026

    Ali Dehghantanha and Sajad Homayoun. Sok: The attack surface of agentic ai–tools, and autonomy.arXiv preprint arXiv:2603.22928, 2026

  42. [42]

    Reason- ing with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

  43. [43]

    Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning.Advances in Neural Information Processing Systems, 36:79081–79094, 2023

  44. [44]

    Generating code world models with large language models guided by monte carlo tree search.Advances in Neural Information Processing Systems, 37:60429–60474, 2024

    Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world models with large language models guided by monte carlo tree search.Advances in Neural Information Processing Systems, 37:60429–60474, 2024. A Data and Code We release our data amd code:https://github.com/idwts/TRUST-BENCH B Related Work From injected content to t...

  45. [45]

    select a compromise pivot, such as scope, recipient identity, audit path, schema field, reusable state, or side effect

  46. [46]

    keep the exploration trajectory behaviorally close to the benign tool so that trust can accumulate during probing

  47. [47]

    modify only the trigger-bearing final state-action composition so that the malicious variant becomes harmful when the hidden rule fires

  48. [48]

    Candidate episodes produced by this process are retained only after a two-person audit

    construct a matched safe control that keeps the same task objective and surface interaction pattern but restores the missing review, audit, or safe-mode constraint. Candidate episodes produced by this process are retained only after a two-person audit. The audit statistics reported in Table 2 are computed on the initial candidate version of the data, befo...

  49. [49]

    sample-level red-blue data construction on the reconstructed benchmark, in which the blue side performs a standardized three-round exploratory interaction and the resulting trajectory is curated into a labeled episode

  50. [50]

    trajectory_digest

    grouped 5-fold training and evaluation on the resulting curated dataset. Each curated episode contains three observed exploratory tool interactions, one proposed final tool call, and one execute/reject label. Grouped splitting keeps related rows from the same base group in the same fold, and threshold calibration is performed on training folds only. To ev...