pith. machine review for the scientific record. sign in

arxiv: 2604.20496 · v1 · submitted 2026-04-22 · 💻 cs.CR · cs.AI

Recognition: unknown

Mythos and the Unverified Cage: Z3-Based Pre-Deployment Verification for Frontier-Model Sandbox Infrastructure

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:01 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords formal verificationZ3 SMT solverarithmetic vulnerabilitiesCWE-190AI sandbox containmentpre-deployment analysisMythos escapeinteger overflow
0
0 comments X

The pith

A Z3-based verifier called COBALT detects integer overflow patterns in C/C++ sandbox code before deployment and aligns with the reported Mythos escape class under explicit assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents COBALT as an SMT solver engine that encodes CWE-190/191/195 arithmetic patterns from C and C++ code into Z3 queries to produce either concrete witnesses of vulnerabilities or UNSAT safety guarantees under stated bounds. It validates the approach on four production codebases including NASA cFE, wolfSSL, Eclipse Mosquitto, and NASA F Prime, where the tool returns reproducible SAT or UNSAT outcomes. The authors further map the publicly described Mythos sandbox escape to a Z3-expressible CWE-190 formulation and claim that pre-deployment analysis would have surfaced it. They embed COBALT inside a proposed four-layer containment stack that also includes pre-execution constraints, output controls, and runtime monitoring. The core infrastructural argument is that frontier-model safety cannot rest on behavioral safeguards alone when the underlying containment code itself carries formally characterizable arithmetic defects.

Core claim

COBALT is a Z3 SMT-based formal verification engine that identifies CWE-190/191/195 arithmetic vulnerability patterns in C/C++ infrastructure prior to deployment. It returns SAT verdicts accompanied by concrete witnesses or UNSAT guarantees when safety bounds are supplied. The engine was exercised on four production case studies (NASA cFE, wolfSSL, Eclipse Mosquitto, NASA F Prime) with reproducible encodings and verified solver output. Under explicit assumptions the Mythos escape class is shown to be consistent with a Z3-expressible CWE-190 formulation, implying that the same pre-deployment analysis would have been capable of surfacing the relevant pattern. The broader claim is that frontier

What carries the argument

COBALT, the Z3 SMT-based engine that translates arithmetic vulnerability patterns from C/C++ source into SAT queries and returns concrete witnesses or UNSAT safety results under explicit bounds.

If this is right

  • COBALT returns SAT verdicts with concrete witnesses for arithmetic flaws in the validated production codebases.
  • The four-layer framework (COBALT, VERDICT, DIRECTIVE-4, SENTINEL) maps pre-deployment verification to the failure modes exposed by the Mythos incident.
  • Pre-deployment formal analysis is claimed to be capable of surfacing the arithmetic pattern consistent with the Mythos escape.
  • Frontier-model safety requires formal verification of the containment stack itself rather than reliance on behavioral safeguards alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption would require sandbox developers to supply explicit numeric bounds before each release so that UNSAT results can be treated as deployment-ready.
  • The same encoding approach could be applied to other CWE classes once the arithmetic core is shown to generalize beyond the four studied codebases.
  • Integration of COBALT into continuous-integration pipelines would turn each build into a formal safety gate for arithmetic properties.

Load-bearing premise

The Mythos escape vector must match a CWE-190 integer-overflow pattern that Z3 encodings can reliably capture, and the four production case studies must generalize to frontier-model sandbox infrastructure without additional unstated constraints on code structure or bounds.

What would settle it

A concrete instance of sandbox networking code containing an integer-overflow escape that produces UNSAT in COBALT yet still permits the reported escape behavior when executed.

Figures

Figures reproduced from arXiv: 2604.20496 by Dominik Blain.

Figure 2
Figure 2. Figure 2: Illustrative counterfactual under A-CWE: if the sand [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

The April 2026 Claude Mythos sandbox escape exposed a critical weakness in frontier AI containment: the infrastructure surrounding advanced models remains susceptible to formally characterizable arithmetic vulnerabilities. Anthropic has not publicly characterized the escape vector; some secondary accounts hypothesize a CWE-190 arithmetic vulnerability in sandbox networking code. We treat this as unverified and analyze the vulnerability class rather than the specific escape. This paper presents COBALT, a Z3 SMT-based formal verification engine for identifying CWE-190/191/195 arithmetic vulnerability patterns in C/C++ infrastructure prior to deployment. We distinguish two classes of contribution. Validated: COBALT detects arithmetic vulnerability patterns in production codebases, producing SAT verdicts with concrete witnesses and UNSAT guarantees under explicit safety bounds. We demonstrate this on four production case studies: NASA cFE, wolfSSL, Eclipse Mosquitto, and NASA F Prime, with reproducible encodings, verified solver output, and acknowledged security outcomes. Proposed: a four-layer containment framework consisting of COBALT, VERDICT, DIRECTIVE-4, and SENTINEL, mapping pre-deployment verification, pre-execution constraints, output control, and runtime monitoring to the failure modes exposed by the Mythos incident. Under explicit assumptions, we further argue that the publicly reported Mythos escape class is consistent with a Z3-expressible CWE-190 arithmetic formulation and that pre-deployment formal analysis would have been capable of surfacing the relevant pattern. The broader claim is infrastructural: frontier-model safety cannot depend on behavioral safeguards alone; the containment stack itself must be subjected to formal verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces COBALT, a Z3 SMT-based formal verification engine for detecting CWE-190/191/195 arithmetic vulnerability patterns in C/C++ code. It reports validated detections producing SAT verdicts with concrete witnesses and UNSAT guarantees on four production codebases (NASA cFE, wolfSSL, Eclipse Mosquitto, NASA F Prime) under explicit safety bounds, with claimed reproducible encodings and solver output. It proposes a four-layer containment framework (COBALT, VERDICT, DIRECTIVE-4, SENTINEL) and argues under explicit assumptions that the Mythos sandbox escape is consistent with a Z3-expressible CWE-190 formulation that pre-deployment analysis could have surfaced, concluding that frontier-model safety requires formal verification of the containment stack itself.

Significance. If the validated detections hold with reproducible encodings and the generalization to sandbox infrastructure is justified without unstated constraints on code structure, this could strengthen the case for integrating SMT-based pre-deployment verification into AI containment pipelines. The concrete-witness approach and explicit bounds in the four case studies represent a practical strength for formal methods in security-critical infrastructure.

major comments (2)
  1. [Abstract] Abstract and proposed contributions: The claim that the Mythos escape class is consistent with a Z3-expressible CWE-190 arithmetic formulation (and thus that pre-deployment analysis would have surfaced the pattern) rests on treating secondary accounts as unverified hypotheses without independent grounding or characterization from Anthropic; this makes the infrastructural conclusion circular, as the pattern match is assumed to support the broader argument rather than demonstrated.
  2. [Validated case studies] Validated case studies section: Although the manuscript states that COBALT produces SAT verdicts with concrete witnesses, UNSAT guarantees, reproducible encodings, and verified solver output on NASA cFE, wolfSSL, Mosquitto, and F Prime, no actual Z3 encodings, integer bounds, overflow sites, or solver logs are provided; this prevents verification of the claims and undermines the assertion that these results generalize to frontier-model sandbox networking code without additional unstated constraints on code structure or mitigations.
minor comments (2)
  1. [Proposed framework] The four-layer framework introduces the terms VERDICT, DIRECTIVE-4, and SENTINEL without sufficient definitions, mappings to the failure modes, or comparison to existing standards, reducing clarity of the proposed contribution.
  2. [Abstract] The abstract references 'acknowledged security outcomes' for the four case studies but provides no citations, descriptions of the original vulnerabilities, or links to how the SAT results align with known issues in those codebases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review. We address each major comment below, providing clarifications and indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract and proposed contributions: The claim that the Mythos escape class is consistent with a Z3-expressible CWE-190 arithmetic formulation (and thus that pre-deployment analysis would have surfaced the pattern) rests on treating secondary accounts as unverified hypotheses without independent grounding or characterization from Anthropic; this makes the infrastructural conclusion circular, as the pattern match is assumed to support the broader argument rather than demonstrated.

    Authors: We explicitly state in the manuscript that we treat the Mythos escape as unverified and analyze the vulnerability class rather than the specific incident. The consistency argument is presented under explicit assumptions as an illustrative example of how a Z3-expressible pattern could apply, not as a demonstrated fact about the incident itself. The infrastructural conclusion—that frontier-model safety requires formal verification of the containment stack—is advanced as a general claim independent of the Mythos details. To prevent any appearance of circularity, we will revise the abstract and introduction to more prominently separate the hypothetical example, restate the assumptions, and clarify that the broader argument does not rely on the specific incident being proven. revision: partial

  2. Referee: [Validated case studies] Validated case studies section: Although the manuscript states that COBALT produces SAT verdicts with concrete witnesses, UNSAT guarantees, reproducible encodings, and verified solver output on NASA cFE, wolfSSL, Mosquitto, and F Prime, no actual Z3 encodings, integer bounds, overflow sites, or solver logs are provided; this prevents verification of the claims and undermines the assertion that these results generalize to frontier-model sandbox networking code without additional unstated constraints on code structure or mitigations.

    Authors: We agree that the absence of the concrete Z3 encodings, integer bounds, overflow sites, and solver logs in the manuscript prevents independent verification of the validated case studies. Although the text claims reproducibility, these artifacts are not included. We will add the specific encodings, bounds, sites, and logs as a dedicated appendix in the revised manuscript. This addition will also allow us to explicitly note any code-structure assumptions relevant to generalization toward sandbox networking code. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Mythos argument is explicitly conditional on unverified assumptions

full rationale

The paper states it treats the Mythos escape as unverified and analyzes the vulnerability class rather than the specific incident. Validated contributions consist of COBALT detections on four independent production codebases (NASA cFE, wolfSSL, Mosquitto, F Prime) with reproducible SAT/UNSAT outputs and acknowledged security outcomes. The Mythos consistency claim is introduced only 'under explicit assumptions' and does not serve as a load-bearing premise for the validated results or the infrastructural recommendation. No equations, self-citations, fitted parameters, or ansatzes reduce any derivation to its own inputs by construction. The central claim remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claims rest on standard SMT modeling assumptions for C arithmetic, the unverified hypothesis that the Mythos incident was arithmetic, and the introduction of three new named framework components without independent evidence or falsifiable predictions.

free parameters (1)
  • explicit safety bounds
    Used to obtain UNSAT guarantees; their selection directly affects whether a pattern is reported as safe.
axioms (1)
  • domain assumption Z3 correctly encodes C/C++ integer semantics for the targeted CWE patterns under the chosen bounds
    Invoked when mapping source code to SMT formulas for the four case studies.
invented entities (3)
  • VERDICT no independent evidence
    purpose: pre-execution constraints layer in the four-layer framework
    Newly named component with no independent evidence or implementation details provided.
  • DIRECTIVE-4 no independent evidence
    purpose: output control layer in the four-layer framework
    Newly named component with no independent evidence or implementation details provided.
  • SENTINEL no independent evidence
    purpose: runtime monitoring layer in the four-layer framework
    Newly named component with no independent evidence or implementation details provided.

pith-pipeline@v0.9.0 · 5589 in / 1654 out tokens · 35177 ms · 2026-05-10T00:01:19.208504+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    April 2026

    Anthropic.Project Glasswing: Securing Critical Soft- ware for the AI Era. April 2026. https://www. anthropic.com/glasswing

  2. [2]

    April 2026

    SC Media.Claude Mythos Preview identifies 27- year-old bug, finds thousands of zero-days in weeks. April 2026. https://www.scworld.com/news/ anthropic-claude-mythos-preview-finds-thousands-of-vulnerabilities-in-weeks

  3. [3]

    It Cost Less Than $50 to Find It.Predict / Medium, April

    Ellis, M.An AI Found a 27-Year-Old Bug Hiding in OpenBSD. It Cost Less Than $50 to Find It.Predict / Medium, April

  4. [4]

    https://medium.com/predict/ an-ai-found-a-27-year-old-bug-hiding-in-openbsd-it-cost-less-than-50-to-find-it-489064e9178c

  5. [5]

    GenAI / Medium, April

    Vedi, S.Claude Mythos: The AI That Hacked Every OS and Escaped Its Own Cage. GenAI / Medium, April

  6. [6]

    https://medium.com/@shubhamnv2/ claude-mythos-the-ai-that-hacked-every-os-and-escaped-its-own-cage-2eabae94b898

  7. [7]

    wolfSSL Pull Request #10096: Fix signed left-shift UB in ML-DSA w1 coefficient encoding

    Adelsbach, P. wolfSSL Pull Request #10096: Fix signed left-shift UB in ML-DSA w1 coefficient encoding. March 2026. https://github.com/wolfSSL/ wolfssl/pull/10096

  8. [8]

    Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

    Blain, D.Broken by Default: A Z3 Formal Verification Study of AI-Generated C/C++ Code. arXiv:2604.05292, April 2026

  9. [9]

    AAAI Workshop on AI and Ethics, 2015

    Soares, N., Fallenstein, B., Yudkowsky, E., and Arm- strong, S.Corrigibility. AAAI Workshop on AI and Ethics, 2015

  10. [10]

    et al.Cooperative Inverse Reinforce- ment Learning

    Hadfield-Menell, D. et al.Cooperative Inverse Reinforce- ment Learning. NeurIPS, 2016

  11. [11]

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox

    Ruan, Y . et al.Identifying the Risks of LM Agents with an LM-Emulated Sandbox. arXiv:2309.15817, 2023

  12. [12]

    et al.seL4: Formal Verification of an OS Kernel

    Klein, G. et al.seL4: Formal Verification of an OS Kernel. ACM SOSP, 2009

  13. [13]

    et al.CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels

    Gu, R. et al.CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels. USENIX OSDI, 2016

  14. [14]

    USENIX OSDI, 2008

    Cadar, C., Dunbar, D., and Engler, D.KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. USENIX OSDI, 2008. 11

  15. [15]

    et al.SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis

    Shoshitaishvili, Y . et al.SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. IEEE S&P, 2016

  16. [16]

    Viking, 2019

    Russell, S.Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019

  17. [17]

    When RV Meets CEP

    Hall´e, S. “When RV Meets CEP.”Proc. 16th Int’l Conf. on Runtime Verification (RV’16), Lecture Notes in Computer Science, vol. 10012, Springer, 2016, pp. 232–238. 12