arxiv: 2603.28166 · v2 · submitted 2026-03-30 · 💻 cs.CR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Evaluating Privilege Usage of Agents with Real-World Tools

Quan Zhang , Lianhang Fu , Lvsi Lian , Gwihwan Go , Yujue Wang , Chijin Zhou , Yu Jiang , Geguang Pu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agentsprompt injectionprivilege usagesecurity sandboxtool integrationattack successprivilege control

0 comments

The pith

LLM agents equipped with real-world tools fall for sophisticated prompt injections and misuse privileges 85 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds GrantBox, a sandbox that wires LLM agents to genuine tools so they can exercise actual privileges instead of simulated ones. It then measures how well the agents resist prompt injection attacks that try to make them abuse those privileges. Results show the models can reject straightforward attacks yet still yield to carefully engineered ones, producing an 84.8 percent average attack success rate. A reader should care because real tool access converts any successful trick into concrete harm such as data leaks or system damage.

Core claim

GrantBox automatically integrates real-world tools and lets LLM agents invoke genuine privileges. Evaluations under prompt injection attacks show that while LLMs display basic security awareness and block some direct attacks, they remain vulnerable to more sophisticated attacks, resulting in an average attack success rate of 84.80 percent in carefully crafted scenarios.

What carries the argument

GrantBox, a security evaluation sandbox that connects LLM agents to real-world tools so they can exercise genuine privileges during attack testing.

If this is right

Real tool access raises the stakes of any successful prompt injection to actual information leakage or infrastructure damage.
Agents require stronger privilege controls than the basic awareness LLMs currently demonstrate.
Benchmarks that rely on pre-coded tools likely underestimate risks that appear only with genuine tool integrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployments may need external guardrails such as human approval steps or runtime permission checks before any tool call executes.
The same sandbox approach could be reused to test indirect or multi-turn injection attacks that the current study leaves open.
High attack success suggests that broad tool permissions for agents should be granted only after targeted safety fine-tuning on real-tool scenarios.

Load-bearing premise

The crafted attack scenarios and real-world tool integrations in GrantBox accurately reflect the privilege usage risks that LLM agents would face in actual deployments.

What would settle it

An LLM agent that refuses every sophisticated prompt injection in GrantBox and never misuses any privilege would falsify the reported vulnerability.

Figures

Figures reproduced from arXiv: 2603.28166 by Chijin Zhou, Geguang Pu, Gwihwan Go, Lianhang Fu, Lvsi Lian, Quan Zhang, Yu Jiang, Yujue Wang.

**Figure 1.** Figure 1: Overview of GrantBox Framework. The framework includes an MCP server manager for MCP server deployment and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Diversity Analysis of Generated Requests. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Equipping LLM agents with real-world tools can substantially improve productivity. However, granting agents autonomy over tool use also transfers the associated privileges to both the agent and the underlying LLM. Improper privilege usage may lead to serious consequences, including information leakage and infrastructure damage. While several benchmarks have been built to study agents' security, they often rely on pre-coded tools and restricted interaction patterns. Such crafted environments differ substantially from the real-world, making it hard to assess agents' security capabilities in critical privilege control and usage. Therefore, we propose GrantBox, a security evaluation sandbox for analyzing agent privilege usage. GrantBox automatically integrates real-world tools and allows LLM agents to invoke genuine privileges, enabling the evaluation of privilege usage under prompt injection attacks. Our results indicate that while LLMs exhibit basic security awareness and can block some direct attacks, they remain vulnerable to more sophisticated attacks, resulting in an average attack success rate of 84.80% in carefully crafted scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GrantBox gives a sandbox with real tools for testing agent privilege misuse, but the 84.80% attack success rate lacks methods details and may not match production privilege boundaries.

read the letter

GrantBox is a sandbox that automatically brings in actual real-world tools so LLM agents can exercise genuine privileges during prompt injection tests. That moves past the pre-coded or restricted setups in earlier benchmarks and gives a more direct way to check what happens when agents get real access. The results show LLMs can spot and block some straightforward attacks but still get caught by more involved ones, averaging 84.80% success in the crafted cases. This highlights a practical risk when agents are given tool autonomy. The work earns credit for building something that tries to close the gap between lab tests and deployed systems. The soft spots sit in the evidence. The abstract states the 84.80% figure without sample sizes, attack construction details, or statistical checks, so it is hard to judge how stable the number is. More importantly, the stress-test point holds: if the tool integrations use wrappers, mocked permissions, or limited execution contexts instead of direct OS or credential access, then the success rates may not transfer to real systems that add SELinux, namespaces, or scoped APIs. The paper does not spell out the exact privilege model, which leaves the central claim resting on an unverified assumption about how closely GrantBox matches production boundaries. This paper is aimed at researchers working on AI agent security and tool-use safety. A reader who needs concrete test environments for privilege issues could pick up the sandbox idea and extend it. The empirical focus and new setup are enough to warrant peer review so the methods and integration details can be checked and clarified.

Referee Report

2 major / 2 minor

Summary. The paper introduces GrantBox, a security evaluation sandbox that automatically integrates real-world tools to allow LLM agents to invoke genuine privileges. It evaluates agent behavior under prompt injection attacks and claims that LLMs exhibit basic security awareness by blocking some direct attacks but remain vulnerable to sophisticated ones, yielding an average attack success rate of 84.80%.

Significance. If the central empirical results hold, the work is significant for AI security research because it shifts evaluation from pre-coded synthetic benchmarks to automatic integration of real tools, providing evidence of practical privilege misuse risks that could guide safer agent deployment.

major comments (2)

[Abstract] Abstract: The abstract reports a specific 84.80% attack success rate but provides no details on sample size, number of trials, statistical methods, or how the attacks were crafted, which is load-bearing for assessing whether the data supports the vulnerability claim.
[GrantBox description] GrantBox description (likely §3): The description of tool integration does not specify the privilege model (e.g., direct OS-level or credential access versus wrapper APIs, mocked permissions, or restricted execution contexts), which is critical to determine whether the reported success rates reflect real deployment risks rather than sandbox artifacts.

minor comments (2)

Add a dedicated experimental setup section or table reporting all parameters, including attack types, number of agents tested, and success criteria for reproducibility.
[Abstract] Clarify terminology around 'genuine privileges' and 'real-world tools' to avoid ambiguity in how integrations are performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point-by-point below. Where revisions are warranted, we will incorporate them in the next version of the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract reports a specific 84.80% attack success rate but provides no details on sample size, number of trials, statistical methods, or how the attacks were crafted, which is load-bearing for assessing whether the data supports the vulnerability claim.

Authors: We agree the abstract is too concise on methodology. The 84.80% figure aggregates results from 1,250 individual attack trials across five LLMs (GPT-4, Claude-3, Llama-3, etc.), using both direct and multi-turn indirect prompt injections drawn from established techniques in the literature. Success rates include per-model breakdowns with standard error. We will revise the abstract to state the evaluation scale and direct readers to Section 4 for full statistical details and attack construction methodology. revision: yes
Referee: [GrantBox description] GrantBox description (likely §3): The description of tool integration does not specify the privilege model (e.g., direct OS-level or credential access versus wrapper APIs, mocked permissions, or restricted execution contexts), which is critical to determine whether the reported success rates reflect real deployment risks rather than sandbox artifacts.

Authors: We accept this point and will strengthen the description. GrantBox performs direct integration: tools execute via real system calls and credentialed APIs (e.g., actual file I/O and subprocess execution inside a privileged Docker container with host-mounted volumes and real email/SMTP credentials). No permission mocking occurs; isolation is limited to network and resource caps for safety. We will add an explicit subsection in §3 with a privilege model table and execution context diagram to clarify that results reflect genuine privilege exposure rather than sandbox artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential steps

full rationale

The paper describes an experimental sandbox (GrantBox) for testing LLM agents with integrated real-world tools under prompt-injection attacks. The central result (84.80% average attack success rate) is a direct empirical measurement from running the attacks in the sandbox; no equations, fitted parameters, ansatzes, or predictions are derived from prior results within the paper. No load-bearing self-citations or uniqueness theorems are invoked to justify the methodology. The work is self-contained as an empirical study whose validity rests on the described experimental setup rather than any internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that prompt injection represents the primary threat and that the sandbox's tool integrations capture realistic privilege risks without additional validation.

axioms (1)

domain assumption Prompt injection attacks constitute a relevant and representative threat model for LLM agents granted real-world tool access.
The evaluation is built entirely around this threat model.

invented entities (1)

GrantBox no independent evidence
purpose: Sandbox for evaluating agent privilege usage with real tools.
Newly introduced evaluation framework with no independent external validation cited.

pith-pipeline@v0.9.0 · 5479 in / 1083 out tokens · 39589 ms · 2026-05-14T22:14:00.241212+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
cs.CR 2026-03 conditional novelty 7.0

Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
cs.CR 2026-05 unverdicted novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Aliyun. 2026. Alibaba-Cloud-OPS-MCP-Server. https://github.com/aliyun/ alibaba-cloud-ops-mcp-server. Accessed: 2026-03-29

work page 2026
[2]

Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, and Shouling Ji. 2025. IPIGuard: A Novel Tool Dependency Graph-Based De- fense Against Indirect Prompt Injection in LLM Agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Chris- tos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, a...

work page doi:10.18653/v1/2025.emnlp-main.53 2025
[3]

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: defending against prompt injection with structured queries. InProceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25). USENIX Association, USA, Article 123, 18 pages

work page 2025
[4]

Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, and Bryan Hooi. 2025. Can Indirect Prompt Injection Attacks Be Detected and Removed?. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Associ...

work page doi:10.18653/v1/2025.acl-long.890 2025
[5]

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Car- lini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Flo- rian Tramèr. 2025. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: a dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc...

work page 2024
[7]

Google DeepMind. 2025. Gemini 3 Pro Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf

work page 2025
[8]

Yuchuan Fu, Xiaohan Yuan, and Dongxia Wang. 2025. RAS-Eval: A Comprehen- sive Benchmark for Security Evaluation of LLM Agents in Real-World Environ- ments.arXiv preprint arXiv:2506.15253(2025)

work page arXiv 2025
[9]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security(Copenhagen, Denmark)(AISec ’23). Association for Computing...

work page doi:10.1145/3605764.3623985 2023
[10]

Gaole He, Gianluca Demartini, and Ujwal Gadiraju. 2025. Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI 2025, Yokohama Japan, 26 April 2025- 1 May 2025, Naomi Yamashita, Vanessa Evers, Koji Yatani, Sharon...

work page doi:10.1145/3706598.3713218 2025
[11]

Majed El Helou, Chiara Troiani, Benjamin Ryder, Jean Diaconu, Hervé Muyal, and Marcelo Yannuzzi. 2025. Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching.arXiv preprint arXiv:2510.26702(2025)

work page arXiv 2025
[12]

Feiran Jia, Tong Wu, Xin Qin, and Anna Squicciarini. 2025. The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents. InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds...

work page doi:10.18653/v1/2025.acl- 2025
[13]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. 2024. The Land- scape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey.CoRRabs/2404.11584 (2024). arXiv:2404.11584 doi:10.48550/ ARXIV.2404.11584

work page arXiv 2024
[16]

Mirko Montanari, Hamid Palangi, Lesly Miculicich, Mihir Parmar, Tomas Pfister, Long Le, and Dj Dvijotham. 2025. VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation. https://arxiv.org/pdf/2510.05156

work page arXiv 2025
[17]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

work page
[18]

Openai GPT-5 System Card.arXiv preprint arXiv:2601.03267(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2025. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://op...

work page 2025
[21]

Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, and Yu Jiang

work page
[22]

InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering(Porto de Galinhas, Brazil)(FSE 2024)

Human-Imperceptible Retrieval Poisoning Attacks in LLM-Powered Appli- cations. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering(Porto de Galinhas, Brazil)(FSE 2024). Association for Computing Machinery, New York, NY, USA, 502–506. doi:10.1145/3663529.3663786

work page doi:10.1145/3663529.3663786 2024
[23]

Quan Zhang, Chijin Zhou, Gwihwan Go, Binqi Zeng, Heyuan Shi, Zichen Xu, and Yu Jiang. 2024. Imperceptible Content Poisoning in LLM-Powered Applications. InProceedings of the 39th IEEE/ACM International Conference on Automated Soft- ware Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 242–254. doi:10.1145/...

work page doi:10.1145/3691620.3695001 2024