Recognition: 2 theorem links
· Lean TheoremContainment Verification: AI Safety Guarantees Independent of Alignment
Pith reviewed 2026-05-12 02:17 UTC · model grok-4.3
The pith
Safety guarantees for AI agents can be located and proven in the containing software framework rather than the AI model itself.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Containment verification locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof. The
What carries the argument
Havoc oracle semantics for the AI agent combined with forward-simulation refinement, where the oracle can produce any output in the typed action space and the containment layer is shown to enforce the policy in every case.
If this is right
- The containment layer guarantees the boundary policy holds for every possible output from the AI oracle within the typed action space.
- The safety property remains enforced regardless of the AI model's internal capabilities or alignment.
- Machine-checked proofs in Dafny supply concrete, auditable evidence for the guarantee.
- The approach has been applied to a working minimalist LLM agent framework called PocketFlow.
Where Pith is reading between the lines
- This method could let developers wrap powerful future models with verified boundaries before full alignment is solved.
- It points toward designing agentic systems from the start with explicit, verifiable action boundaries rather than relying only on training.
- The same containment pattern might extend to multi-agent or tool-using systems where actions cross different software layers.
Load-bearing premise
That safety-relevant properties can be expressed as boundary-enforceable properties over the modeled events, arguments, and state, and that the havoc oracle semantics plus the typed action boundary together capture all relevant behaviors of the real AI agent.
What would settle it
An experiment in which a real LLM agent inside the verified PocketFlow framework produces an output that violates the boundary policy despite the formal containment guarantee, or a safety property that cannot be expressed using only the modeled boundary events and state.
Figures
read the original abstract
Agentic frameworks are the software layer through which AI agents act in the world. Existing safety methods intervene on the model and therefore remain conditional on unverifiable properties of learned behavior. We introduce containment verification, which locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof under an information barrier against tautological specifications. To our knowledge, this is the first deductive formal verification of an agentic framework, and its guarantee is invariant to model capability over the modeled typed action boundary.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces containment verification for agentic AI frameworks, modeling the AI as an unconstrained havoc oracle over a typed action space. It proves via forward-simulation refinement that a verified containment layer enforces boundary-enforceable properties (expressed over boundary events, action arguments, and state) for every possible oracle output, mechanizes the proof in Dafny, and instantiates the approach on PocketFlow using an agentic synthesis pipeline with an information barrier to avoid tautological specifications. The central claim is that this yields safety guarantees independent of model alignment or capability, at least over the modeled boundary.
Significance. If the boundary model is complete, the result would provide a mechanized, alignment-independent safety guarantee for properties expressible at the typed action interface, which is a notable advance over alignment-dependent methods. The machine-checked Dafny proof and information-barrier synthesis pipeline are clear strengths that support reproducibility and reduce circularity risks. However, the significance hinges on whether the havoc oracle and typed boundary capture all safety-relevant behaviors, an assumption left unverified in the manuscript.
major comments (2)
- [Abstract, PocketFlow instantiation] Abstract and PocketFlow instantiation: The universal guarantee is established only inside the model (havoc oracle ranging over the typed action space). No argument, completeness proof, or empirical check is given that this boundary captures all safety-relevant real-world behaviors of LLM agents (e.g., untyped tool invocations, persistent state not represented in the model, or indirect side-effects). This assumption is load-bearing for the headline claim of alignment-independent guarantees.
- [Formal model / havoc oracle semantics] Formal model section (havoc oracle semantics): The proof relies on the typed action boundary and havoc oracle together being exhaustive. Without a justification or additional modeling of potential bypasses (interpretation-dependent actions, unmodeled APIs), the forward-simulation refinement does not transfer to real agents, undermining the claim that the guarantee is 'invariant to model capability'.
minor comments (1)
- [Abstract] Clarify in the abstract and introduction the precise scope: the guarantee holds inside the model but requires separate justification for transfer to deployed systems.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting the importance of clearly scoping the formal guarantees. We address each major comment below, agreeing to revisions that better emphasize the modeled boundary while preserving the core contribution of mechanized containment verification.
read point-by-point responses
-
Referee: [Abstract, PocketFlow instantiation] Abstract and PocketFlow instantiation: The universal guarantee is established only inside the model (havoc oracle ranging over the typed action space). No argument, completeness proof, or empirical check is given that this boundary captures all safety-relevant real-world behaviors of LLM agents (e.g., untyped tool invocations, persistent state not represented in the model, or indirect side-effects). This assumption is load-bearing for the headline claim of alignment-independent guarantees.
Authors: We agree that the universal guarantee holds strictly inside the formal model, where the AI is represented as a havoc oracle over the typed action space, and that no completeness proof or empirical validation is provided for coverage of all real-world behaviors. The manuscript already qualifies the result as applying 'over the modeled typed action boundary' for boundary-enforceable properties. To address the concern about potential misreading of the headline claim, we will revise the abstract and introduction to state explicitly that the alignment-independent guarantee applies only to properties expressible at the typed interface within the model. We do not claim the boundary is complete for arbitrary LLM agents; users of the approach must define a faithful model for their specific safety properties and interface. revision: yes
-
Referee: [Formal model / havoc oracle semantics] Formal model section (havoc oracle semantics): The proof relies on the typed action boundary and havoc oracle together being exhaustive. Without a justification or additional modeling of potential bypasses (interpretation-dependent actions, unmodeled APIs), the forward-simulation refinement does not transfer to real agents, undermining the claim that the guarantee is 'invariant to model capability'.
Authors: The forward-simulation refinement proves that the containment layer enforces the specified boundary policy for every possible output of the havoc oracle, thereby making the guarantee independent of the underlying AI model's alignment or capability within the modeled space. We acknowledge that the proof assumes the typed action boundary is the complete interface and does not address potential bypasses such as untyped actions or unmodeled state changes. We will add an explicit limitations paragraph in the formal model section clarifying this assumption and noting that transfer to deployed agents requires the model to faithfully represent the actual action interface. No additional modeling of bypasses is included because the technique is parametric over any chosen typed boundary; the invariance claim is scoped accordingly and is not undermined within that scope. revision: partial
Circularity Check
No circularity: mechanized refinement proof is deductive within explicit model
full rationale
The paper's central derivation is a standard forward-simulation refinement proof, mechanized in Dafny, establishing that any containment layer enforces boundary-enforceable properties for every output of the havoc oracle over the typed action space. This holds by construction of the model and the proof obligations; the universal guarantee is not fitted to data, not renamed from a prior result, and does not rely on self-citations for its load-bearing steps. The modeling choice of an unconstrained havoc oracle makes the guarantee independent of alignment by definition of the semantics, rather than deriving that independence from the proof itself. The separate assumption that the typed boundary captures all real-world behaviors is stated explicitly and is not part of the verified chain, so the derivation does not reduce to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Forward-simulation refinement implies that every possible oracle output is contained by the verified layer
invented entities (2)
-
Havoc oracle semantics
no independent evidence
-
Containment layer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny... under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=
work page 2022
-
[2]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. 2023 , eprint=
work page 2023
-
[3]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=
work page 2024
-
[4]
Measuring Faithfulness in Chain-of-Thought Reasoning , author=. 2023 , eprint=
work page 2023
-
[5]
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. 2023 , eprint=
work page 2023
-
[6]
AI Control: Improving Safety Despite Intentional Subversion , author=. 2024 , eprint=
work page 2024
-
[7]
Ctrl-Z: Controlling AI Agents via Resampling , author=. 2025 , eprint=
work page 2025
- [8]
-
[9]
Improving Alignment and Robustness with Circuit Breakers , author=. 2024 , eprint=
work page 2024
- [10]
-
[11]
The Alignment Problem from a Deep Learning Perspective , author=. 2025 , eprint=
work page 2025
-
[12]
VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation , author=. 2025 , eprint=
work page 2025
-
[13]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents , author=. 2025 , eprint=
work page 2025
- [14]
-
[15]
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems , author=. 2024 , eprint=
work page 2024
- [16]
-
[17]
Chase, Harrison , year =
-
[18]
2025 , howpublished =
work page 2025
-
[19]
Steinberger, Peter , year =
-
[20]
2026 , howpublished =
work page 2026
-
[21]
Huang, Zachary , year =
-
[22]
Formal Methods for Components and Objects (FMCO) , series =
Boogie: A Modular Reusable Verifier for Object-Oriented Programs , author =. Formal Methods for Components and Objects (FMCO) , series =. 2005 , publisher =
work page 2005
- [23]
-
[24]
Dafny as Verification-Aware Intermediate Language for Code Generation , author=. 2025 , eprint=
work page 2025
-
[25]
Klein, Gerwin and Elphinstone, Kevin and Heiser, Gernot and Andronick, June and Cock, David and Derrin, Philip and Elkaduwe, Dhammika and Engelhardt, Kai and Kolanski, Rafal and Norrish, Michael and Sewell, Thomas and Tuch, Harvey and Winwood, Simon , booktitle =
-
[26]
and Parno, Bryan and Roberts, Michael L
Hawblitzel, Chris and Howell, Jon and Kapritsos, Manos and Lorch, Jacob R. and Parno, Bryan and Roberts, Michael L. and Setty, Srinath and Zill, Brian , booktitle =
-
[27]
Communications of the ACM , volume =
Formal Verification of a Realistic Compiler , author =. Communications of the ACM , volume =
-
[28]
Computer Aided Verification (CAV) , series =
Counterexample-Guided Abstraction Refinement , author =. Computer Aided Verification (CAV) , series =. 2000 , publisher =
work page 2000
-
[29]
Formal Methods in System Design , volume =
Efficient Detection of Vacuity in Temporal Model Checking , author =. Formal Methods in System Design , volume =
-
[30]
Transformer Circuits Thread , year =
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. Transformer Circuits Thread , year =
-
[31]
BEAVER: An Efficient Deterministic LLM Verifier , author=. 2025 , eprint=
work page 2025
-
[32]
How Catastrophic is Your LLM? Certifying Risk in Conversation , author=. 2026 , eprint=
work page 2026
-
[33]
Lumos: Let there be Language Model System Certification , author=. 2026 , eprint=
work page 2026
-
[34]
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , year =
Eli Goldweber and Weixin Yu and Seyed Armin Vakil Ghahani and Manos Kapritsos , title =. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , year =
-
[35]
VeriAct: Beyond Verifiability -- Agentic Synthesis of Correct and Complete Formal Specifications , author=. 2026 , eprint=
work page 2026
-
[36]
ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis , author=. 2026 , eprint=
work page 2026
-
[37]
Copy Fail: 732 Bytes to Root on Every Major Linux Distribution , author =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.