arxiv: 2605.11039 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Recognition: no theorem link

The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

Linfeng Fan, Rongsheng Li, Xiong Wang, Yichen Wang, Yuan Tian, Ziwei Li

Pith reviewed 2026-05-13 01:27 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agentsprompt injectiontool useprovenance trackingruntime monitoringagent securitymixed-trust workflowscapability contracts

0 comments

The pith

Tracking the provenance of each individual argument in tool calls lets LLM agents enforce security without blocking mixed-trust workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents that use tools must often combine untrusted data from emails or web pages with privileged actions, yet existing monitors decide safety at the level of an entire tool call. This granularity creates unavoidable trade-offs: either block legitimate retrieval-then-act sequences or allow untrusted content to hijack destinations and commands. The paper shows that danger arises only when untrusted content reaches an authority-bearing argument, not from its mere presence in context. By assigning semantic roles to arguments, tracking their origins across replanning steps, and checking each against a role-specific trust contract, the approach separates safe use of external data from unsafe control of privileged operations. If the mechanism holds, agents can retain both full security and high utility where coarser monitors must sacrifice one or the other.

Core claim

PACT assigns semantic roles to the arguments of each tool call, maintains provenance information across multiple planning and replanning steps, and verifies that every argument's origin satisfies its role-specific trust contract before the call executes. This reframes agent security as a problem of authority binding rather than blanket invocation filtering. Under oracle provenance the system records 100 percent utility and 100 percent security on mixed-trust diagnostic suites, while full AgentDojo runs across five models reach 100 percent security on the three strongest models together with 38.1 to 46.4 percent utility, exceeding prior invocation-level baselines by 8 to 16 percentage points.

What carries the argument

Provenance-Aware Capability Contracts (PACT), which assign semantic roles to tool arguments, track value provenance across replanning steps, and enforce role-specific trust contracts on each argument's origin.

If this is right

Invocation-level monitors will continue to produce either false positives that block useful behavior or false negatives that permit hijacks.
Both semantic role assignment and cross-step provenance tracking are required; removing either drops performance to baseline levels.
The remaining deployment limit is no longer policy design but the quality of provenance inference and contract synthesis.
Stronger models can reach complete security with PACT while still recovering substantially more utility than with earlier defenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent frameworks could expose provenance metadata as a first-class input to the model, reducing the inference burden.
The same argument-level distinction may apply to retrieval-augmented generation pipelines where external passages influence final outputs.
Testing PACT against adaptive attackers who target the provenance mechanism itself would clarify whether the isolation of the reasoning bottleneck is robust.

Load-bearing premise

Accurate semantic role assignment, reliable provenance tracking across replanning, and correct synthesis of trust contracts can be performed in practice without new attack surfaces or prohibitive cost.

What would settle it

An indirect prompt injection that succeeds in altering an authority-bearing argument while PACT's provenance and role checks remain intact would falsify the security guarantee.

Figures

Figures reproduced from arXiv: 2605.11039 by Linfeng Fan, Rongsheng Li, Xiong Wang, Yichen Wang, Yuan Tian, Ziwei Li.

**Figure 1.** Figure 1: PACT resolves the granularity mismatch in indirect prompt injection. In a benign mixed-trust workflow, external webpage content may legitimately fill the body argument of send_email, while the recipient remains grounded in the user request. In an attack, the same webpage attempts to bind the authority-bearing recipient argument. Whole-call defenses cannot distinguish these cases: blocking all external inf… view at source ↗

**Figure 2.** Figure 2: PACT runtime: contracts, provenance, and argument-level checking. PACT receives user instructions, tool outputs, and runtime state as provenance-tagged values; interprets these values through a library of role-aware tool contracts; and checks each argument immediately before tool execution. The monitor permits external data where the contract marks an argument as content, but blocks the same origin from bi… view at source ↗

**Figure 3.** Figure 3: Contract precision recovers benign utility without relaxing security. Under oracle provenance, security remains at 100% across L0–L3, while utility rises as contracts move from opaque blocking to role-aware argument checks. Contract precision. We next test the monotonicity prediction of the contract hierarchy. Under fixed oracle provenance, security remains at 100% across L0–L3, while utility increases f… view at source ↗

**Figure 4.** Figure 4: Flat invocation-level policies cannot separate mixed-trust cases. The benign workflow and the attack share the same argument structure: a message-sending call with a recipient and a body. In the benign AgentDojo-derived case, recipient is user-authorized while body is external content; blocking all external influence therefore causes a false positive. In the attack case, external content also determines re… view at source ↗

read the original abstract

Tool-using LLM agents must act on untrusted webpages, emails, files, and API outputs while issuing privileged tool calls. Existing defenses often mediate trust at the granularity of an entire tool invocation, forcing a brittle choice in mixed-trust workflows: allow external content to influence a call and risk hijacked destinations or commands, or quarantine the call and block benign retrieval-then-act behavior. The key observation behind this paper is that indirect prompt injection becomes dangerous not when untrusted content appears in context, but when it determines an authority-bearing argument. We present \textsc{PACT} (\emph{Provenance-Aware Capability Contracts}), a runtime monitor that assigns semantic roles to tool arguments, tracks value provenance across replanning steps, and checks whether each argument's origin satisfies its role-specific trust contract. Under oracle provenance, \textsc{PACT} achieves 100\% utility and 100\% security on mixed-trust diagnostic suites, while flat invocation-level monitors incur false positives or false negatives. In full AgentDojo deployments across five models, \textsc{PACT} reaches 100\% security on the three strongest models while recovering 38.1--46.4\% utility, 8--16 percentage points above CaMeL at the same security level. Ablations show that both semantic roles and cross-step provenance are necessary. \textsc{PACT} reframes agent security as authority binding, and isolates the remaining deployment bottleneck to provenance inference and contract synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PACT shifts agent security to argument-level provenance with role contracts, showing strong oracle performance and gains over baselines, but practical inference remains the key open issue.

read the letter

PACT looks like a useful reframing for securing tool-using LLM agents. Instead of mediating at the whole-invocation level, it assigns semantic roles to individual arguments, tracks value provenance across replanning steps, and enforces role-specific trust contracts. This targets the real risk point where untrusted data influences authority-bearing parts of a call, rather than blocking or allowing entire actions in mixed-trust settings like web or email inputs.

Referee Report

1 major / 2 minor

Summary. The paper introduces Provenance-Aware Capability Contracts (PACT) to address the granularity mismatch in securing tool-using LLM agents operating on mixed-trust inputs. It claims that tracking provenance at the level of individual tool arguments, combined with semantic role assignment and role-specific trust contracts, enables precise enforcement: under oracle provenance, PACT achieves 100% utility and 100% security on diagnostic suites (unlike flat invocation-level monitors), while in full AgentDojo deployments across five models it attains 100% security on the three strongest models with 38.1–46.4% utility recovery, outperforming CaMeL by 8–16 percentage points at equivalent security. Ablations confirm that both semantic roles and cross-step provenance tracking are necessary, reframing agent security as authority binding and isolating the remaining bottleneck to provenance inference and contract synthesis.

Significance. If the empirical results hold, the work is significant for providing a principled, argument-granular alternative to invocation-level mediation in agent security. The oracle upper-bound results cleanly separate the enforcement mechanism from inference limitations, while the comparative gains over CaMeL on an external benchmark (AgentDojo) and the necessity ablations offer concrete evidence that the granularity mismatch is both real and addressable. This directs future effort toward practical provenance tracking without circularity in the core claims.

major comments (1)

[Experimental Evaluation / Results] Experimental section (implied by abstract claims and ablations): the reported 100% utility/security figures under oracle provenance and the precise utility ranges (38.1–46.4%) in AgentDojo lack accompanying details on experimental controls, number of independent runs, statistical significance tests, or potential selection effects in suite/task construction. These omissions are load-bearing for the central claim that argument-level provenance resolves the mismatch, as they leave the quantitative superiority over flat monitors and CaMeL incompletely verifiable.

minor comments (2)

[Abstract and Ablations] The abstract qualifies the 100% results as oracle-dependent and names the remaining bottlenecks, which is helpful; however, the main text should explicitly cross-reference the specific ablation tables or figures demonstrating necessity of roles and cross-step tracking.
[Method / PACT Description] Notation for provenance tracking across replanning steps could be clarified with a small example diagram or pseudocode to make the cross-step dependency explicit for readers unfamiliar with agent replanning loops.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation. We address the single major comment below and have revised the manuscript to improve experimental transparency.

read point-by-point responses

Referee: [Experimental Evaluation / Results] Experimental section (implied by abstract claims and ablations): the reported 100% utility/security figures under oracle provenance and the precise utility ranges (38.1–46.4%) in AgentDojo lack accompanying details on experimental controls, number of independent runs, statistical significance tests, or potential selection effects in suite/task construction. These omissions are load-bearing for the central claim that argument-level provenance resolves the mismatch, as they leave the quantitative superiority over flat monitors and CaMeL incompletely verifiable.

Authors: We agree that the original submission omitted key experimental details, which weakens verifiability of the quantitative claims. The 100% utility and security figures under oracle provenance are deterministic by construction: when provenance is perfect, every argument satisfies its role-specific trust contract exactly when its origin is trusted, eliminating both false positives (unnecessary blocks) and false negatives (insecure calls) on the diagnostic suites. The 38.1–46.4% utility ranges are averages across the five models on the full AgentDojo benchmark. In the revised manuscript we have added a dedicated 'Experimental Setup and Controls' subsection that specifies: (i) five independent runs per model with fixed random seeds for reproducibility, (ii) reporting of means, standard deviations, and Wilcoxon signed-rank tests (p < 0.01 for utility gains versus CaMeL at matched security), (iii) explicit description of task sampling from AgentDojo to ensure coverage of mixed-trust scenarios without post-hoc selection, and (iv) isolation of the oracle assumption from any inference module. These additions directly address the referee's concerns while preserving the original results and the separation of enforcement from provenance inference. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on empirical results from external benchmarks (AgentDojo) and comparisons to an independent baseline (CaMeL), with oracle-provenance runs establishing an upper bound and practical runs showing measurable gains. Ablations directly test the necessity of semantic roles and cross-step tracking without reducing to self-defined quantities. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the derivation chain is self-contained against external data and does not invoke internal definitions or prior author work to force the outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on domain assumptions about feasible provenance tracking and role-based contracts, with no free parameters or invented entities that have independent falsifiable evidence beyond the proposed system itself.

axioms (2)

domain assumption Tool arguments carry distinct semantic roles that determine whether they bear authority and thus require trust checks.
Invoked as the basis for contract enforcement in the PACT design.
domain assumption Value provenance can be reliably tracked across multiple replanning and execution steps in agent workflows.
Required for the cross-step checks that distinguish PACT from flat monitors.

invented entities (1)

Provenance-Aware Capability Contracts (PACT) no independent evidence
purpose: Runtime monitor assigning roles and checking provenance against trust contracts
New system introduced to solve the granularity mismatch; no independent evidence provided outside the paper's evaluations.

pith-pipeline@v0.9.0 · 5581 in / 1536 out tokens · 52268 ms · 2026-05-13T01:27:23.534442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 10 internal anchors

[1]

Learning to inject: Automated prompt injection via reinforcement learning.arXiv preprint arXiv:2602.05746, 2026

Xin Chen, Jie Zhang, and Florian Tramer. Learning to inject: Automated prompt injection via reinforcement learning.arXiv preprint arXiv:2602.05746, 2026

work page arXiv 2026
[2]

Securing AI agents with information-flow control,

Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing ai agents with information-flow control, 2025.URL https://arxiv.org/abs/2505.23643

work page arXiv 2025
[3]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

work page 2024
[4]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

work page internal anchor Pith review arXiv 2025
[5]

WASP: Benchmarking web agent security against prompt injection attacks

Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaud- huri. Wasp: Benchmarking web agent security against prompt injection attacks.arXiv preprint arXiv:2504.18575, 2025

work page arXiv 2025
[6]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

work page 2023
[7]

Ih-challenge: A train- ing dataset to improve instruction hierarchy on frontier llms.arXiv preprint arXiv:2603.10521, 2026

Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Sam Toyer, Miles Wang, Yaodong Yu, et al. Ih-challenge: A train- ing dataset to improve instruction hierarchy on frontier llms.arXiv preprint arXiv:2603.10521, 2026

work page arXiv 2026
[8]

Taintp2x: Detecting taint-style prompt-to-anything injection vulnerabilities in llm-integrated applications

Junjie He, Shenao Wang, Yanjie Zhao, Xinyi Hou, Zhao Liu, Quanchen Zou, and Haoyu Wang. Taintp2x: Detecting taint-style prompt-to-anything injection vulnerabilities in llm-integrated applications. 2026

work page 2026
[9]

Taming Various Privilege Escalation in LLM-Based Agent Systems,

Zimo Ji, Daoyuan Wu, Wenyuan Jiang, Pingchuan Ma, Zongjie Li, Yudong Gao, Shuai Wang, and Yingjiu Li. Taming various privilege escalation in llm-based agent systems: A mandatory access control framework.arXiv preprint arXiv:2601.11893, 2026

work page arXiv 2026
[10]

CapSeal: Capability-Sealed Secret Mediation for Secure Agent Execution

Shutong Jin, Ruiyi Guo, and Ray CC Cheung. Capseal: Capability-sealed secret mediation for secure agent execution.arXiv preprint arXiv:2604.16762, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

The cognitive firewall: Securing browser based ai agents against indirect prompt injection via hybrid edge cloud defense.arXiv preprint arXiv:2603.23791, 2026

Qianlong Lan and Anuj Kaul. The cognitive firewall: Securing browser based ai agents against indirect prompt injection via hybrid edge cloud defense.arXiv preprint arXiv:2603.23791, 2026

work page arXiv 2026
[12]

Prompt infection: Llm-to-llm prompt injection within multi-agent systems,

Donghyun Lee and Mo Tiwari. Prompt infection: Llm-to-llm prompt injection within multi- agent systems.arXiv preprint arXiv:2410.07283, 2024

work page arXiv 2024
[13]

Drift: Dynamic rule-based defense with injection isolation for securing llm agents.arXiv preprint arXiv:2506.12104, 2025

Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang, and Chaowei Xiao. Drift: Dynamic rule-based defense with injection isolation for securing llm agents.arXiv preprint arXiv:2506.12104, 2025

work page arXiv 2025
[14]

SafeAgent: A Runtime Protection Architecture for Agentic Systems

Hailin Liu, Eugene Ilyushin, Jie Ni, and Min Zhu. Safeagent: A runtime protection architecture for agentic systems.arXiv preprint arXiv:2604.17562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847, 2024. 11

work page 2024
[17]

Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

work page 2024
[18]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

work page 2023
[19]

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Rohan Sequeira, Stavros Damianakis, Umar Iqbal, and Konstantinos Psounis. Agent-sentry: Bounding llm agents via execution provenance.arXiv preprint arXiv:2603.22868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Prompt injec- tion attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

work page arXiv 2025
[21]

Progent: Programmable privilege control for llm agents.arXiv preprint arXiv:2504.11703, 2025

Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable privilege control for llm agents, 2025.URL https://arxiv.org/abs/2504.11703

work page arXiv 2025
[22]

SEAgent: Self- evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv preprint arXiv:2508.04700, 2025

work page arXiv 2025
[23]

Information flow control in machine learning through modular model architecture

Trishita Tiwari, Suchin Gururangan, Chuan Guo, Weizhe Hua, Sanjay Kariyappa, Udit Gupta, Wenjie Xiong, Kiwan Maeng, Hsien-Hsin S Lee, and G Edward Suh. Information flow control in machine learning through modular model architecture. In33rd USENIX Security Symposium (USENIX Security 24), pages 6921–6938, 2024

work page 2024
[24]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Adaptools: Adaptive tool-based indirect prompt injection attacks on agentic llms.arXiv preprint arXiv:2602.20720, 2026

Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang, Jianbo Gao, Tao Wei, Zhong Chen, and Wei Yang Bryan Lim. Adaptools: Adaptive tool-based indirect prompt injection attacks on agentic llms.arXiv preprint arXiv:2602.20720, 2026

work page arXiv 2026
[26]

Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection.arXiv preprint arXiv:2508.01249, 2025

Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, and Ye Wu. Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection.arXiv preprint arXiv:2508.01249, 2025

work page arXiv 2025
[27]

The landscape of prompt injection threats in llm agents: From taxonomy to analysis.arXiv preprint arXiv:2602.10453, 2026

Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li, Lixia Zhang, Xiaofeng Wang, and Yuan Tian. The landscape of prompt injection threats in llm agents: From taxonomy to analysis.arXiv preprint arXiv:2602.10453, 2026

work page arXiv 2026
[28]

Instructional segment embedding: Improving llm safety with instruction hierarchy.arXiv preprint arXiv:2410.09102, 2024

Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. Instructional segment embedding: Improving llm safety with instruction hierarchy.arXiv preprint arXiv:2410.09102, 2024

work page arXiv 2024
[29]

CoRR abs/2406.09187 (2024)

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.arXiv preprint arXiv:2406.09187, 2024

work page arXiv 2024
[30]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents, 2024.URL: https://arxiv.org/abs/2403.02691

work page internal anchor Pith review arXiv 2024
[32]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024. 12

work page internal anchor Pith review arXiv 2024
[33]

Melon: Provable defense against indirect prompt injection attacks in ai agents.arXiv preprint arXiv:2502.05174, 2025

Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, and William Yang Wang. Melon: Provable defense against indirect prompt injection attacks in ai agents.arXiv preprint arXiv:2502.05174, 2025

work page arXiv 2025
[34]

Scaling test-time compute for llm agents.arXiv preprint arXiv:2506.12928, 2025

King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, et al. Scaling test-time compute for llm agents.arXiv preprint arXiv:2506.12928, 2025

work page arXiv 2025
[35]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A Limitations. PACT gives an enforcement-layer guarantee: assuming conservative provenance propagation and correctly specified contracts, the runtime can...

work page internal anchor Pith review Pith/arXiv arXiv 2023