pith. sign in

arxiv: 2603.05786 · v2 · pith:T6NCHRKNnew · submitted 2026-03-06 · 💻 cs.CR · cs.AI· cs.CL

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

classification 💻 cs.CR cs.AIcs.CL
keywords guardrailproof-of-guardrailagentagentsdeveloperexecutionsafetycode
0
0 comments X
read the original abstract

As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BraveGuard: From Open-World Threats to Safer Computer-Use Agents

    cs.CR 2026-05 unverdicted novelty 5.0

    BraveGuard trains guard models on realistic agent trajectories derived from open-world threats, raising detection accuracy on AgentHazard from 38.79% to 82.38%.

  2. Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI

    cs.SE 2026-04 unverdicted novelty 5.0

    Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.

  3. From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI

    cs.AI 2026-04 unverdicted novelty 4.0

    The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time...