pith. machine review for the scientific record. sign in

arxiv: 2605.09045 · v1 · submitted 2026-05-09 · 💻 cs.AI · cs.CR· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Containment Verification: AI Safety Guarantees Independent of Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CRcs.SE
keywords containment verificationAI safetyformal verificationagentic frameworksDafnyhavoc oracleboundary enforcementLLM agents
0
0 comments X

The pith

Safety guarantees for AI agents can be located and proven in the containing software framework rather than the AI model itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces containment verification to place safety assurances directly inside agentic frameworks that control how AI agents act in the world. It models any AI as an unconstrained havoc oracle that can output any action in a typed space and proves that a containment layer will still enforce chosen boundary policies for every possible output. The proof uses forward-simulation refinement and is mechanized with machine-checked proofs in Dafny. A reader would care because this approach makes safety independent of unverifiable properties of the learned model, such as alignment or capability limits. The method is shown on PocketFlow, a small LLM agent framework, using an automated pipeline to generate the model and proof.

Core claim

Containment verification locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof. The

What carries the argument

Havoc oracle semantics for the AI agent combined with forward-simulation refinement, where the oracle can produce any output in the typed action space and the containment layer is shown to enforce the policy in every case.

If this is right

  • The containment layer guarantees the boundary policy holds for every possible output from the AI oracle within the typed action space.
  • The safety property remains enforced regardless of the AI model's internal capabilities or alignment.
  • Machine-checked proofs in Dafny supply concrete, auditable evidence for the guarantee.
  • The approach has been applied to a working minimalist LLM agent framework called PocketFlow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could let developers wrap powerful future models with verified boundaries before full alignment is solved.
  • It points toward designing agentic systems from the start with explicit, verifiable action boundaries rather than relying only on training.
  • The same containment pattern might extend to multi-agent or tool-using systems where actions cross different software layers.

Load-bearing premise

That safety-relevant properties can be expressed as boundary-enforceable properties over the modeled events, arguments, and state, and that the havoc oracle semantics plus the typed action boundary together capture all relevant behaviors of the real AI agent.

What would settle it

An experiment in which a real LLM agent inside the verified PocketFlow framework produces an output that violates the boundary policy despite the formal containment guarantee, or a safety property that cannot be expressed using only the modeled boundary events and state.

Figures

Figures reproduced from arXiv: 2605.09045 by Lav R. Varshney, Royce Moon.

Figure 1
Figure 1. Figure 1: Deployment stack of an AI agent. Action is the typed interface through which the AI’s outputs reach the containment layer. The containment layer is formally verified whereas the sandbox, host kernel, and hardware beneath it are operationally audited. formally verified. The recently disclosed Copy Fail vulnera￾bility, a Linux kernel privilege escalation latent for nearly a decade, empirically illustrates th… view at source ↗
Figure 2
Figure 2. Figure 2: Boundary-event refinement. Each concrete step labeled by action a and event eI is matched by an abstract step with the same action, a related event eM, and related post-state. 3.3. Soundness theorem Theorem 3.2 (Containment Verification Soundness). Let TM and TI be total labeled transition systems sharing action space A. Assume: (A1) Abstract havoc safety. For every havoc trace (sM i , ai , eM i )i≥0 ∈ TrH… view at source ↗
read the original abstract

Agentic frameworks are the software layer through which AI agents act in the world. Existing safety methods intervene on the model and therefore remain conditional on unverifiable properties of learned behavior. We introduce containment verification, which locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof under an information barrier against tautological specifications. To our knowledge, this is the first deductive formal verification of an agentic framework, and its guarantee is invariant to model capability over the modeled typed action boundary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces containment verification for agentic AI frameworks, modeling the AI as an unconstrained havoc oracle over a typed action space. It proves via forward-simulation refinement that a verified containment layer enforces boundary-enforceable properties (expressed over boundary events, action arguments, and state) for every possible oracle output, mechanizes the proof in Dafny, and instantiates the approach on PocketFlow using an agentic synthesis pipeline with an information barrier to avoid tautological specifications. The central claim is that this yields safety guarantees independent of model alignment or capability, at least over the modeled boundary.

Significance. If the boundary model is complete, the result would provide a mechanized, alignment-independent safety guarantee for properties expressible at the typed action interface, which is a notable advance over alignment-dependent methods. The machine-checked Dafny proof and information-barrier synthesis pipeline are clear strengths that support reproducibility and reduce circularity risks. However, the significance hinges on whether the havoc oracle and typed boundary capture all safety-relevant behaviors, an assumption left unverified in the manuscript.

major comments (2)
  1. [Abstract, PocketFlow instantiation] Abstract and PocketFlow instantiation: The universal guarantee is established only inside the model (havoc oracle ranging over the typed action space). No argument, completeness proof, or empirical check is given that this boundary captures all safety-relevant real-world behaviors of LLM agents (e.g., untyped tool invocations, persistent state not represented in the model, or indirect side-effects). This assumption is load-bearing for the headline claim of alignment-independent guarantees.
  2. [Formal model / havoc oracle semantics] Formal model section (havoc oracle semantics): The proof relies on the typed action boundary and havoc oracle together being exhaustive. Without a justification or additional modeling of potential bypasses (interpretation-dependent actions, unmodeled APIs), the forward-simulation refinement does not transfer to real agents, undermining the claim that the guarantee is 'invariant to model capability'.
minor comments (1)
  1. [Abstract] Clarify in the abstract and introduction the precise scope: the guarantee holds inside the model but requires separate justification for transfer to deployed systems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the importance of clearly scoping the formal guarantees. We address each major comment below, agreeing to revisions that better emphasize the modeled boundary while preserving the core contribution of mechanized containment verification.

read point-by-point responses
  1. Referee: [Abstract, PocketFlow instantiation] Abstract and PocketFlow instantiation: The universal guarantee is established only inside the model (havoc oracle ranging over the typed action space). No argument, completeness proof, or empirical check is given that this boundary captures all safety-relevant real-world behaviors of LLM agents (e.g., untyped tool invocations, persistent state not represented in the model, or indirect side-effects). This assumption is load-bearing for the headline claim of alignment-independent guarantees.

    Authors: We agree that the universal guarantee holds strictly inside the formal model, where the AI is represented as a havoc oracle over the typed action space, and that no completeness proof or empirical validation is provided for coverage of all real-world behaviors. The manuscript already qualifies the result as applying 'over the modeled typed action boundary' for boundary-enforceable properties. To address the concern about potential misreading of the headline claim, we will revise the abstract and introduction to state explicitly that the alignment-independent guarantee applies only to properties expressible at the typed interface within the model. We do not claim the boundary is complete for arbitrary LLM agents; users of the approach must define a faithful model for their specific safety properties and interface. revision: yes

  2. Referee: [Formal model / havoc oracle semantics] Formal model section (havoc oracle semantics): The proof relies on the typed action boundary and havoc oracle together being exhaustive. Without a justification or additional modeling of potential bypasses (interpretation-dependent actions, unmodeled APIs), the forward-simulation refinement does not transfer to real agents, undermining the claim that the guarantee is 'invariant to model capability'.

    Authors: The forward-simulation refinement proves that the containment layer enforces the specified boundary policy for every possible output of the havoc oracle, thereby making the guarantee independent of the underlying AI model's alignment or capability within the modeled space. We acknowledge that the proof assumes the typed action boundary is the complete interface and does not address potential bypasses such as untyped actions or unmodeled state changes. We will add an explicit limitations paragraph in the formal model section clarifying this assumption and noting that transfer to deployed agents requires the model to faithfully represent the actual action interface. No additional modeling of bypasses is included because the technique is parametric over any chosen typed boundary; the invariance claim is scoped accordingly and is not undermined within that scope. revision: partial

Circularity Check

0 steps flagged

No circularity: mechanized refinement proof is deductive within explicit model

full rationale

The paper's central derivation is a standard forward-simulation refinement proof, mechanized in Dafny, establishing that any containment layer enforces boundary-enforceable properties for every output of the havoc oracle over the typed action space. This holds by construction of the model and the proof obligations; the universal guarantee is not fitted to data, not renamed from a prior result, and does not rely on self-citations for its load-bearing steps. The modeling choice of an unconstrained havoc oracle makes the guarantee independent of alignment by definition of the semantics, rather than deriving that independence from the proof itself. The separate assumption that the typed boundary captures all real-world behaviors is stated explicitly and is not part of the verified chain, so the derivation does not reduce to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The claim rests on the ability to define boundary-enforceable properties and on the soundness of forward-simulation refinement for universal quantification over oracle outputs; no numerical free parameters are introduced.

axioms (1)
  • standard math Forward-simulation refinement implies that every possible oracle output is contained by the verified layer
    Invoked to obtain the universal guarantee from the refinement relation.
invented entities (2)
  • Havoc oracle semantics no independent evidence
    purpose: Model the AI as ranging over the entire typed action space without behavioral constraints
    New modeling device introduced to make the safety guarantee independent of model capability.
  • Containment layer no independent evidence
    purpose: The software component that enforces the boundary policy for every possible AI output
    Central new construct whose verification constitutes the safety guarantee.

pith-pipeline@v0.9.0 · 5470 in / 1470 out tokens · 57403 ms · 2026-05-12T02:17:54.235938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    2022 , eprint=

    Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

  2. [2]

    2023 , eprint=

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. 2023 , eprint=

  3. [3]

    2024 , eprint=

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

  4. [4]

    2023 , eprint=

    Measuring Faithfulness in Chain-of-Thought Reasoning , author=. 2023 , eprint=

  5. [5]

    2023 , eprint=

    Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. 2023 , eprint=

  6. [6]

    2024 , eprint=

    AI Control: Improving Safety Despite Intentional Subversion , author=. 2024 , eprint=

  7. [7]

    2025 , eprint=

    Ctrl-Z: Controlling AI Agents via Resampling , author=. 2025 , eprint=

  8. [8]

    2025 , eprint=

    Scaling Laws For Scalable Oversight , author=. 2025 , eprint=

  9. [9]

    2024 , eprint=

    Improving Alignment and Robustness with Circuit Breakers , author=. 2024 , eprint=

  10. [10]

    2024 , eprint=

    Is Power-Seeking AI an Existential Risk? , author=. 2024 , eprint=

  11. [11]

    2025 , eprint=

    The Alignment Problem from a Deep Learning Perspective , author=. 2025 , eprint=

  12. [12]

    2025 , eprint=

    VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation , author=. 2025 , eprint=

  13. [13]

    2025 , eprint=

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    Defeating Prompt Injections by Design , author=. 2025 , eprint=

  15. [15]

    2024 , eprint=

    Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems , author=. 2024 , eprint=

  16. [16]

    2024 , howpublished =

    Introducing the. 2024 , howpublished =

  17. [17]

    Chase, Harrison , year =

  18. [18]

    2025 , howpublished =

  19. [19]

    Steinberger, Peter , year =

  20. [20]

    2026 , howpublished =

  21. [21]

    Huang, Zachary , year =

  22. [22]

    Formal Methods for Components and Objects (FMCO) , series =

    Boogie: A Modular Reusable Verifier for Object-Oriented Programs , author =. Formal Methods for Components and Objects (FMCO) , series =. 2005 , publisher =

  23. [23]

    Rustan M

    Leino, K. Rustan M. , title =. 2008 , month =

  24. [24]

    2025 , eprint=

    Dafny as Verification-Aware Intermediate Language for Code Generation , author=. 2025 , eprint=

  25. [25]

    Klein, Gerwin and Elphinstone, Kevin and Heiser, Gernot and Andronick, June and Cock, David and Derrin, Philip and Elkaduwe, Dhammika and Engelhardt, Kai and Kolanski, Rafal and Norrish, Michael and Sewell, Thomas and Tuch, Harvey and Winwood, Simon , booktitle =

  26. [26]

    and Parno, Bryan and Roberts, Michael L

    Hawblitzel, Chris and Howell, Jon and Kapritsos, Manos and Lorch, Jacob R. and Parno, Bryan and Roberts, Michael L. and Setty, Srinath and Zill, Brian , booktitle =

  27. [27]

    Communications of the ACM , volume =

    Formal Verification of a Realistic Compiler , author =. Communications of the ACM , volume =

  28. [28]

    Computer Aided Verification (CAV) , series =

    Counterexample-Guided Abstraction Refinement , author =. Computer Aided Verification (CAV) , series =. 2000 , publisher =

  29. [29]

    Formal Methods in System Design , volume =

    Efficient Detection of Vacuity in Temporal Model Checking , author =. Formal Methods in System Design , volume =

  30. [30]

    Transformer Circuits Thread , year =

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. Transformer Circuits Thread , year =

  31. [31]

    2025 , eprint=

    BEAVER: An Efficient Deterministic LLM Verifier , author=. 2025 , eprint=

  32. [32]

    2026 , eprint=

    How Catastrophic is Your LLM? Certifying Risk in Conversation , author=. 2026 , eprint=

  33. [33]

    2026 , eprint=

    Lumos: Let there be Language Model System Certification , author=. 2026 , eprint=

  34. [34]

    18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , year =

    Eli Goldweber and Weixin Yu and Seyed Armin Vakil Ghahani and Manos Kapritsos , title =. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , year =

  35. [35]

    2026 , eprint=

    VeriAct: Beyond Verifiability -- Agentic Synthesis of Correct and Complete Formal Specifications , author=. 2026 , eprint=

  36. [36]

    2026 , eprint=

    ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis , author=. 2026 , eprint=

  37. [37]

    Copy Fail: 732 Bytes to Root on Every Major Linux Distribution , author =