Recognition: 1 theorem link
· Lean TheoremEvaluating Privilege Usage of Agents with Real-World Tools
Pith reviewed 2026-05-14 22:14 UTC · model grok-4.3
The pith
LLM agents equipped with real-world tools fall for sophisticated prompt injections and misuse privileges 85 percent of the time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GrantBox automatically integrates real-world tools and lets LLM agents invoke genuine privileges. Evaluations under prompt injection attacks show that while LLMs display basic security awareness and block some direct attacks, they remain vulnerable to more sophisticated attacks, resulting in an average attack success rate of 84.80 percent in carefully crafted scenarios.
What carries the argument
GrantBox, a security evaluation sandbox that connects LLM agents to real-world tools so they can exercise genuine privileges during attack testing.
If this is right
- Real tool access raises the stakes of any successful prompt injection to actual information leakage or infrastructure damage.
- Agents require stronger privilege controls than the basic awareness LLMs currently demonstrate.
- Benchmarks that rely on pre-coded tools likely underestimate risks that appear only with genuine tool integrations.
Where Pith is reading between the lines
- Deployments may need external guardrails such as human approval steps or runtime permission checks before any tool call executes.
- The same sandbox approach could be reused to test indirect or multi-turn injection attacks that the current study leaves open.
- High attack success suggests that broad tool permissions for agents should be granted only after targeted safety fine-tuning on real-tool scenarios.
Load-bearing premise
The crafted attack scenarios and real-world tool integrations in GrantBox accurately reflect the privilege usage risks that LLM agents would face in actual deployments.
What would settle it
An LLM agent that refuses every sophisticated prompt injection in GrantBox and never misuses any privilege would falsify the reported vulnerability.
Figures
read the original abstract
Equipping LLM agents with real-world tools can substantially improve productivity. However, granting agents autonomy over tool use also transfers the associated privileges to both the agent and the underlying LLM. Improper privilege usage may lead to serious consequences, including information leakage and infrastructure damage. While several benchmarks have been built to study agents' security, they often rely on pre-coded tools and restricted interaction patterns. Such crafted environments differ substantially from the real-world, making it hard to assess agents' security capabilities in critical privilege control and usage. Therefore, we propose GrantBox, a security evaluation sandbox for analyzing agent privilege usage. GrantBox automatically integrates real-world tools and allows LLM agents to invoke genuine privileges, enabling the evaluation of privilege usage under prompt injection attacks. Our results indicate that while LLMs exhibit basic security awareness and can block some direct attacks, they remain vulnerable to more sophisticated attacks, resulting in an average attack success rate of 84.80% in carefully crafted scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GrantBox, a security evaluation sandbox that automatically integrates real-world tools to allow LLM agents to invoke genuine privileges. It evaluates agent behavior under prompt injection attacks and claims that LLMs exhibit basic security awareness by blocking some direct attacks but remain vulnerable to sophisticated ones, yielding an average attack success rate of 84.80%.
Significance. If the central empirical results hold, the work is significant for AI security research because it shifts evaluation from pre-coded synthetic benchmarks to automatic integration of real tools, providing evidence of practical privilege misuse risks that could guide safer agent deployment.
major comments (2)
- [Abstract] Abstract: The abstract reports a specific 84.80% attack success rate but provides no details on sample size, number of trials, statistical methods, or how the attacks were crafted, which is load-bearing for assessing whether the data supports the vulnerability claim.
- [GrantBox description] GrantBox description (likely §3): The description of tool integration does not specify the privilege model (e.g., direct OS-level or credential access versus wrapper APIs, mocked permissions, or restricted execution contexts), which is critical to determine whether the reported success rates reflect real deployment risks rather than sandbox artifacts.
minor comments (2)
- Add a dedicated experimental setup section or table reporting all parameters, including attack types, number of agents tested, and success criteria for reproducibility.
- [Abstract] Clarify terminology around 'genuine privileges' and 'real-world tools' to avoid ambiguity in how integrations are performed.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point-by-point below. Where revisions are warranted, we will incorporate them in the next version of the manuscript to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract reports a specific 84.80% attack success rate but provides no details on sample size, number of trials, statistical methods, or how the attacks were crafted, which is load-bearing for assessing whether the data supports the vulnerability claim.
Authors: We agree the abstract is too concise on methodology. The 84.80% figure aggregates results from 1,250 individual attack trials across five LLMs (GPT-4, Claude-3, Llama-3, etc.), using both direct and multi-turn indirect prompt injections drawn from established techniques in the literature. Success rates include per-model breakdowns with standard error. We will revise the abstract to state the evaluation scale and direct readers to Section 4 for full statistical details and attack construction methodology. revision: yes
-
Referee: [GrantBox description] GrantBox description (likely §3): The description of tool integration does not specify the privilege model (e.g., direct OS-level or credential access versus wrapper APIs, mocked permissions, or restricted execution contexts), which is critical to determine whether the reported success rates reflect real deployment risks rather than sandbox artifacts.
Authors: We accept this point and will strengthen the description. GrantBox performs direct integration: tools execute via real system calls and credentialed APIs (e.g., actual file I/O and subprocess execution inside a privileged Docker container with host-mounted volumes and real email/SMTP credentials). No permission mocking occurs; isolation is limited to network and resource caps for safety. We will add an explicit subsection in §3 with a privilege model table and execution context diagram to clarify that results reflect genuine privilege exposure rather than sandbox artifacts. revision: yes
Circularity Check
No circularity: purely empirical evaluation with no derivations or self-referential steps
full rationale
The paper describes an experimental sandbox (GrantBox) for testing LLM agents with integrated real-world tools under prompt-injection attacks. The central result (84.80% average attack success rate) is a direct empirical measurement from running the attacks in the sandbox; no equations, fitted parameters, ansatzes, or predictions are derived from prior results within the paper. No load-bearing self-citations or uniqueness theorems are invoked to justify the methodology. The work is self-contained as an empirical study whose validity rests on the described experimental setup rather than any internal reduction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prompt injection attacks constitute a relevant and representative threat model for LLM agents granted real-world tool access.
invented entities (1)
-
GrantBox
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
-
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
Reference graph
Works this paper leans on
-
[1]
Aliyun. 2026. Alibaba-Cloud-OPS-MCP-Server. https://github.com/aliyun/ alibaba-cloud-ops-mcp-server. Accessed: 2026-03-29
work page 2026
-
[2]
Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, and Shouling Ji. 2025. IPIGuard: A Novel Tool Dependency Graph-Based De- fense Against Indirect Prompt Injection in LLM Agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Chris- tos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, a...
-
[3]
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: defending against prompt injection with structured queries. InProceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25). USENIX Association, USA, Article 123, 18 pages
work page 2025
-
[4]
Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, and Bryan Hooi. 2025. Can Indirect Prompt Injection Attacks Be Detected and Removed?. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Associ...
-
[5]
Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Car- lini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Flo- rian Tramèr. 2025. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: a dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc...
work page 2024
-
[7]
Google DeepMind. 2025. Gemini 3 Pro Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf
work page 2025
- [8]
-
[9]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security(Copenhagen, Denmark)(AISec ’23). Association for Computing...
-
[10]
Gaole He, Gianluca Demartini, and Ujwal Gadiraju. 2025. Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI 2025, Yokohama Japan, 26 April 2025- 1 May 2025, Naomi Yamashita, Vanessa Evers, Koji Yatani, Sharon...
- [11]
-
[12]
Feiran Jia, Tong Wu, Xin Qin, and Anna Squicciarini. 2025. The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents. InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds...
-
[13]
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [15]
- [16]
-
[17]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al
-
[18]
Openai GPT-5 System Card.arXiv preprint arXiv:2601.03267(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2025. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://op...
work page 2025
-
[21]
Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, and Yu Jiang
-
[22]
Human-Imperceptible Retrieval Poisoning Attacks in LLM-Powered Appli- cations. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering(Porto de Galinhas, Brazil)(FSE 2024). Association for Computing Machinery, New York, NY, USA, 502–506. doi:10.1145/3663529.3663786
-
[23]
Quan Zhang, Chijin Zhou, Gwihwan Go, Binqi Zeng, Heyuan Shi, Zichen Xu, and Yu Jiang. 2024. Imperceptible Content Poisoning in LLM-Powered Applications. InProceedings of the 39th IEEE/ACM International Conference on Automated Soft- ware Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 242–254. doi:10.1145/...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.