Controlled experiments on GPT-4o-mini and Claude Haiku show indirect prompt injection success in ReAct agents decays sharply with injection depth, varies with payload framing, and remains stable across turn budgets.
SafeAgent: A Runtime Protection Architecture for Agentic Systems
4 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large language model (LLM) agents are vulnerable to prompt-injection attacks that propagate through multi-step workflows, tool interactions, and persistent context, making input-output filtering alone insufficient for reliable protection. This paper presents SafeAgent, a runtime security architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories. The proposed design separates execution governance from semantic risk reasoning through two coordinated components: a runtime controller that mediates actions around the agent loop and a context-aware decision core that operates over persistent session state. The core is formalized as a context-aware advanced machine intelligence and instantiated through operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. Experiments on Agent Security Bench (ASB) and InjecAgent show that SafeAgent consistently improves robustness over baseline and text-level guardrail methods while maintaining competitive benign-task performance. Ablation studies further show that recovery confidence and policy weighting determine distinct safety-utility operating points.
citation-role summary
citation-polarity summary
fields
cs.CR 4years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
Proposes a trust schema including verification levels and a biconditional correctness criterion to verify skills in human-in-the-loop agent runtimes, reducing the need for constant oversight.
citing papers explorer
-
Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity
Controlled experiments on GPT-4o-mini and Claude Haiku show indirect prompt injection success in ReAct agents decays sharply with injection depth, varies with payload framing, and remains stable across turn budgets.
-
AIRGuard: Guarding Agent Actions with Runtime Authority Control
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
-
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
-
Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes
Proposes a trust schema including verification levels and a biconditional correctness criterion to verify skills in human-in-the-loop agent runtimes, reducing the need for constant oversight.