Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Aaron Chan; Junyi Du; Michael Duan; Qin Lin; Xiang Ren; Xisen Jin; Zhenglun Chen

arxiv: 2603.05786 · v2 · pith:T6NCHRKNnew · submitted 2026-03-06 · 💻 cs.CR · cs.AI· cs.CL

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin , Michael Duan , Qin Lin , Aaron Chan , Zhenglun Chen , Junyi Du , Xiang Ren This is my paper

classification 💻 cs.CR cs.AIcs.CL

keywords guardrailproof-of-guardrailagentagentsdeveloperexecutionsafetycode

0 comments

read the original abstract

As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BraveGuard: From Open-World Threats to Safer Computer-Use Agents
cs.CR 2026-05 unverdicted novelty 5.0

BraveGuard trains guard models on realistic agent trajectories derived from open-world threats, raising detection accuracy on AgentHazard from 38.79% to 82.38%.
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
cs.SE 2026-04 unverdicted novelty 5.0

Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.
From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI
cs.AI 2026-04 unverdicted novelty 4.0

The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time...