arxiv: 2604.17562 · v1 · submitted 2026-04-19 · 💻 cs.AI · cs.MA

Recognition: unknown

SafeAgent: A Runtime Protection Architecture for Agentic Systems

Hailin Liu , Eugene Ilyushin , Jie Ni , Min Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:03 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords LLM agentsprompt injectionruntime securityagent safetycontext-aware reasoningmulti-step workflowsexecution governance

0 comments

The pith

SafeAgent protects LLM agents from propagating prompt-injection attacks by separating runtime execution governance from context-aware risk reasoning over persistent session state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that input-output filtering alone cannot stop prompt-injection attacks that spread through multi-step workflows, tool calls, and persistent context in LLM agents. It proposes modeling agent safety as a stateful decision problem instead. SafeAgent uses a runtime controller to mediate every action in the agent loop together with a decision core that reasons over the full interaction history. The core applies operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. Experiments on Agent Security Bench and InjecAgent show higher robustness than baselines and text guardrails while keeping competitive performance on benign tasks.

Core claim

SafeAgent treats agent safety as a stateful decision problem over evolving interaction trajectories. It separates execution governance from semantic risk reasoning by coordinating a runtime controller that mediates actions around the agent loop with a context-aware decision core instantiated through operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. This design enables detection and mitigation of attacks that propagate across tool interactions and persistent context.

What carries the argument

The runtime controller that mediates actions around the agent loop together with the context-aware decision core that operates over persistent session state using operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization.

Load-bearing premise

The separation of execution governance from semantic risk reasoning via a runtime controller and context-aware decision core will reliably detect and mitigate propagating prompt-injection attacks in multi-step workflows without excessive false positives or performance loss.

What would settle it

A specific multi-step task on Agent Security Bench in which a prompt-injection attack propagates and causes the agent to execute a harmful tool call despite mediation by the runtime controller and decisions from the context-aware core.

Figures

Figures reproduced from arXiv: 2604.17562 by Eugene Ilyushin, Hailin Liu, Jie Ni, Min Zhu.

**Figure 2.** Figure 2: Overview of the SafeAgent architecture. the agent’s intermediate representations and decision points. Unlike sample-level filtering, taint tracking focuses on information flow across a workflow: external observations (e.g., retrieval snippets, tool outputs, logs) can influence planning, tool arguments, and memory updates, turning a localized injection into persistent control over future behavior. Recent s… view at source ↗

**Figure 3.** Figure 3: Flow control sequence in the agent lifecycle. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Execution flow of the SafeAgent Core. Costs are represented along multiple dimensions reflecting system overhead and performance tradeoffs, enabling consistent comparison across heterogeneous intervention strategies. This unified evaluation framework allows the system to select actions under a coherent safety–utility criterion. c) World Model Simulation: The world model operator performs latent-space simu… view at source ↗

read the original abstract

Large language model (LLM) agents are vulnerable to prompt-injection attacks that propagate through multi-step workflows, tool interactions, and persistent context, making input-output filtering alone insufficient for reliable protection. This paper presents SafeAgent, a runtime security architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories. The proposed design separates execution governance from semantic risk reasoning through two coordinated components: a runtime controller that mediates actions around the agent loop and a context-aware decision core that operates over persistent session state. The core is formalized as a context-aware advanced machine intelligence and instantiated through operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. Experiments on Agent Security Bench (ASB) and InjecAgent show that SafeAgent consistently improves robustness over baseline and text-level guardrail methods while maintaining competitive benign-task performance. Ablation studies further show that recovery confidence and policy weighting determine distinct safety-utility operating points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeAgent frames agent safety as a stateful runtime decision problem and separates governance from risk reasoning, but the abstract gives no numbers or methods to judge whether the gains are real.

read the letter

Hi, the main point is that SafeAgent treats prompt injection in LLM agents as a stateful problem over full trajectories instead of relying on input-output filters. It splits the work between a runtime controller that sits in the agent loop and a context-aware decision core that tracks persistent state, then implements the core with five operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. That separation is the clearest new piece, and it directly targets how attacks can spread through tool calls and session memory in multi-step workflows. The experiments on Agent Security Bench and InjecAgent are reported to beat both plain baselines and text-level guardrails while keeping benign-task performance competitive, and the ablations on recovery confidence and policy weighting show tunable safety-utility points. Those are useful observations if they hold up. The soft spots are straightforward: the abstract supplies no metrics, error bars, ablation tables, or methodology, so there is no way to tell how large the robustness lift actually is or whether false-positive rates stay acceptable in longer runs. The formalization of the decision core is mentioned but not shown, which leaves open whether it rests on unstated assumptions or fitted parameters. This paper is aimed at people building or deploying agentic systems who need runtime defenses beyond simple guardrails. A reader working on agent security would find the architecture sketch and benchmark framing worth seeing, even if the evidence is still thin. I would send it to peer review because the problem is timely and the proposed split of concerns is a reasonable direction that deserves a full look at the methods and data.

Referee Report

0 major / 3 minor

Summary. The paper introduces SafeAgent, a runtime security architecture for LLM-based agentic systems vulnerable to propagating prompt-injection attacks in multi-step workflows and tool interactions. It separates execution governance from semantic risk reasoning via a runtime controller that mediates the agent loop and a context-aware decision core operating over persistent session state. The decision core is formalized as a context-aware advanced machine intelligence instantiated through five operators (risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization). Experiments on Agent Security Bench (ASB) and InjecAgent benchmarks report consistent robustness gains over baselines and text-level guardrails while preserving competitive benign-task performance; ablation studies examine the effects of recovery confidence and policy weighting on safety-utility operating points.

Significance. If the experimental claims are supported by detailed results in the full manuscript, the work addresses a timely and important problem in AI safety by shifting from static filtering to stateful, trajectory-aware runtime protection. The explicit separation of governance and risk reasoning, combined with formalization of the decision core and ablation analysis, provides a concrete design pattern that could be adopted or extended in practical agent deployments. The use of two established benchmarks strengthens the evaluation, though the magnitude of gains and statistical robustness would determine broader impact.

minor comments (3)

[Abstract] Abstract: The claim of 'consistent' robustness improvements and 'competitive' benign-task performance is stated without any quantitative metrics, confidence intervals, or table references, which reduces immediate clarity even if the full experiments section supplies them.
[Section 3] Section 3 (formalization): The phrase 'context-aware advanced machine intelligence' for the decision core is non-standard and lacks a precise definition or reference to related formalisms (e.g., POMDPs or online decision processes); clarifying its relation to the five listed operators would improve precision.
[Experiments] Experiments section: While ablation studies on recovery confidence and policy weighting are mentioned, the manuscript would benefit from explicit reporting of false-positive rates on benign tasks and any statistical tests for the robustness gains to allow readers to evaluate the safety-utility trade-off quantitatively.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of SafeAgent, the recognition of its timely contribution to runtime protection for LLM agents, and the recommendation for minor revision. The emphasis on stateful trajectory-aware protection and the formalization of the decision core aligns with our goals. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in architectural proposal or validation

full rationale

The paper describes SafeAgent as a runtime architecture with a controller and context-aware decision core instantiated via five operators (risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, state synchronization). No mathematical derivations, equations, first-principles predictions, or fitted parameters presented as independent results appear in the abstract or description. Validation consists of empirical experiments on ASB and InjecAgent benchmarks demonstrating robustness gains with competitive benign performance; this is external evidence rather than a reduction to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The design is self-contained as an engineering proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work appears to rest on standard AI safety assumptions and benchmark evaluations rather than new theoretical primitives.

pith-pipeline@v0.9.0 · 5461 in / 990 out tokens · 56851 ms · 2026-05-10T06:03:11.808684+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
cs.CR 2026-05 unverdicted novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...

Reference graph

Works this paper leans on

20 extracted references · 14 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” inInternational Conference on Learning Representations (ICLR), 2023

2023
[2]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models,”arXiv preprint arXiv:2302.12173, 2023

work page internal anchor Pith review arXiv 2023
[3]

From prompt injections to protocol exploits: Threats in LLM-powered AI agents workflows,

M. A. Ferrag, N. Tihanyi, D. Hamouda, L. Maglaras, A. Lakas, and M. Debbah, “From prompt injections to protocol exploits: Threats in LLM-powered AI agents workflows,”ICT Express, 2025, available online 13 December 2025

2025
[4]

Instruction backdoor attacks against customized LLMs,

R. Zhang, H. Li, R. Wen, W. Jiang, Y . Zhang, M. Backes, Y . Shen, and Y . Zhang, “Instruction backdoor attacks against customized LLMs,” in 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 2024

2024
[5]

A path towards autonomous machine intelligence,

Y . LeCun, “A path towards autonomous machine intelligence,” 2022, manuscript, version 0.9.2. Accessed: 2026-02-14. [Online]. Available: https://openreview.net/forum?id=BZ5a1r-kVsf

2022
[6]

JATMO: Prompt injection defense by task-specific finetuning,

J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “JATMO: Prompt injection defense by task-specific finetuning,”arXiv preprint arXiv:2312.17673, 2024

work page arXiv 2024
[7]

RedVisor: Reasoning-aware prompt injection defense via zero-copy KV cache reuse,

M. Liu, S. Zhang, C. Long, and K.-Y . Lam, “RedVisor: Reasoning-aware prompt injection defense via zero-copy KV cache reuse,”arXiv preprint arXiv:2602.01795, 2026

work page arXiv 2026
[8]

PISanitizer: Preventing prompt injection to long-context LLMs via prompt sanitiza- tion,

R. Geng, Y . Wang, C. Yin, M. Cheng, Y . Chen, and J. Jia, “PISanitizer: Preventing prompt injection to long-context LLMs via prompt sanitiza- tion,”arXiv preprint arXiv:2511.10720, 2025

work page arXiv 2025
[9]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa, “Llama Guard: LLM-based input-output safeguard for human-AI conversations,”arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review arXiv 2023
[10]

LLM Guard: The security toolkit for LLM interactions,

Protect AI, “LLM Guard: The security toolkit for LLM interactions,” 2024, gitHub repository. Accessed: 2026-02-15. [Online]. Available: https://github.com/protectai/llm-guard

2024
[11]

GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

Y . Xie, M. Fang, R. Piet al., “GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis,”arXiv preprint arXiv:2402.13494, 2024

work page arXiv 2024
[12]

doi:10.48550/arXiv.2501.18492 , abstract =

Y . Liu, H. Gao, S. Zhaiet al., “GuardReasoner: Towards reasoning-based LLM safeguards,”arXiv preprint arXiv:2501.18492, 2025

work page arXiv 2025
[13]

arXiv preprint arXiv:2410.22770 , year=

H. Li and X. Liu, “InjecGuard: Benchmarking and mitigating over-defense in prompt injection guardrail models,”arXiv preprint arXiv:2410.22770, 2024

work page arXiv 2024
[14]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

E. Wallaceet al., “The instruction hierarchy: Training language models to prioritize privileged instructions,”arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review arXiv 2024
[15]

AgentSys: Secure and dynamic LLM agents through explicit hierarchical memory management,

R. Wen, H. Li, C. Xiao, and N. Zhang, “AgentSys: Secure and dynamic LLM agents through explicit hierarchical memory management,”arXiv preprint arXiv:2602.07398, 2026

work page arXiv 2026
[16]

IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents.arXiv preprint arXiv:2508.15310, 2025

H. An, J. Zhang, T. Du, C. Zhou, Q. Li, T. Lin, and S. Ji, “IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents,”arXiv preprint arXiv:2508.15310, 2025

work page arXiv 2025
[17]

Malik Sallam

T. Rebedeaet al., “NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,”arXiv preprint arXiv:2310.10501, 2023

work page arXiv 2023
[18]

Introducing gpt-oss-safeguard,

OpenAI, “Introducing gpt-oss-safeguard,” 2025, product release, October 29, 2025. [Online]. Available: https://openai.com/index/ introducing-gpt-oss-safeguard/

2025
[19]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, and Y . Zhang, “Agent security bench (ASB): Formalizing and bench- marking attacks and defenses in LLM-based agents,”arXiv preprint arXiv:2410.02644, 2025

work page internal anchor Pith review arXiv 2025
[20]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,”arXiv preprint arXiv:2403.02691, 2024

work page internal anchor Pith review arXiv 2024