Recognition: unknown
Mythos and the Unverified Cage: Z3-Based Pre-Deployment Verification for Frontier-Model Sandbox Infrastructure
Pith reviewed 2026-05-10 00:01 UTC · model grok-4.3
The pith
A Z3-based verifier called COBALT detects integer overflow patterns in C/C++ sandbox code before deployment and aligns with the reported Mythos escape class under explicit assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COBALT is a Z3 SMT-based formal verification engine that identifies CWE-190/191/195 arithmetic vulnerability patterns in C/C++ infrastructure prior to deployment. It returns SAT verdicts accompanied by concrete witnesses or UNSAT guarantees when safety bounds are supplied. The engine was exercised on four production case studies (NASA cFE, wolfSSL, Eclipse Mosquitto, NASA F Prime) with reproducible encodings and verified solver output. Under explicit assumptions the Mythos escape class is shown to be consistent with a Z3-expressible CWE-190 formulation, implying that the same pre-deployment analysis would have been capable of surfacing the relevant pattern. The broader claim is that frontier
What carries the argument
COBALT, the Z3 SMT-based engine that translates arithmetic vulnerability patterns from C/C++ source into SAT queries and returns concrete witnesses or UNSAT safety results under explicit bounds.
If this is right
- COBALT returns SAT verdicts with concrete witnesses for arithmetic flaws in the validated production codebases.
- The four-layer framework (COBALT, VERDICT, DIRECTIVE-4, SENTINEL) maps pre-deployment verification to the failure modes exposed by the Mythos incident.
- Pre-deployment formal analysis is claimed to be capable of surfacing the arithmetic pattern consistent with the Mythos escape.
- Frontier-model safety requires formal verification of the containment stack itself rather than reliance on behavioral safeguards alone.
Where Pith is reading between the lines
- Adoption would require sandbox developers to supply explicit numeric bounds before each release so that UNSAT results can be treated as deployment-ready.
- The same encoding approach could be applied to other CWE classes once the arithmetic core is shown to generalize beyond the four studied codebases.
- Integration of COBALT into continuous-integration pipelines would turn each build into a formal safety gate for arithmetic properties.
Load-bearing premise
The Mythos escape vector must match a CWE-190 integer-overflow pattern that Z3 encodings can reliably capture, and the four production case studies must generalize to frontier-model sandbox infrastructure without additional unstated constraints on code structure or bounds.
What would settle it
A concrete instance of sandbox networking code containing an integer-overflow escape that produces UNSAT in COBALT yet still permits the reported escape behavior when executed.
Figures
read the original abstract
The April 2026 Claude Mythos sandbox escape exposed a critical weakness in frontier AI containment: the infrastructure surrounding advanced models remains susceptible to formally characterizable arithmetic vulnerabilities. Anthropic has not publicly characterized the escape vector; some secondary accounts hypothesize a CWE-190 arithmetic vulnerability in sandbox networking code. We treat this as unverified and analyze the vulnerability class rather than the specific escape. This paper presents COBALT, a Z3 SMT-based formal verification engine for identifying CWE-190/191/195 arithmetic vulnerability patterns in C/C++ infrastructure prior to deployment. We distinguish two classes of contribution. Validated: COBALT detects arithmetic vulnerability patterns in production codebases, producing SAT verdicts with concrete witnesses and UNSAT guarantees under explicit safety bounds. We demonstrate this on four production case studies: NASA cFE, wolfSSL, Eclipse Mosquitto, and NASA F Prime, with reproducible encodings, verified solver output, and acknowledged security outcomes. Proposed: a four-layer containment framework consisting of COBALT, VERDICT, DIRECTIVE-4, and SENTINEL, mapping pre-deployment verification, pre-execution constraints, output control, and runtime monitoring to the failure modes exposed by the Mythos incident. Under explicit assumptions, we further argue that the publicly reported Mythos escape class is consistent with a Z3-expressible CWE-190 arithmetic formulation and that pre-deployment formal analysis would have been capable of surfacing the relevant pattern. The broader claim is infrastructural: frontier-model safety cannot depend on behavioral safeguards alone; the containment stack itself must be subjected to formal verification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COBALT, a Z3 SMT-based formal verification engine for detecting CWE-190/191/195 arithmetic vulnerability patterns in C/C++ code. It reports validated detections producing SAT verdicts with concrete witnesses and UNSAT guarantees on four production codebases (NASA cFE, wolfSSL, Eclipse Mosquitto, NASA F Prime) under explicit safety bounds, with claimed reproducible encodings and solver output. It proposes a four-layer containment framework (COBALT, VERDICT, DIRECTIVE-4, SENTINEL) and argues under explicit assumptions that the Mythos sandbox escape is consistent with a Z3-expressible CWE-190 formulation that pre-deployment analysis could have surfaced, concluding that frontier-model safety requires formal verification of the containment stack itself.
Significance. If the validated detections hold with reproducible encodings and the generalization to sandbox infrastructure is justified without unstated constraints on code structure, this could strengthen the case for integrating SMT-based pre-deployment verification into AI containment pipelines. The concrete-witness approach and explicit bounds in the four case studies represent a practical strength for formal methods in security-critical infrastructure.
major comments (2)
- [Abstract] Abstract and proposed contributions: The claim that the Mythos escape class is consistent with a Z3-expressible CWE-190 arithmetic formulation (and thus that pre-deployment analysis would have surfaced the pattern) rests on treating secondary accounts as unverified hypotheses without independent grounding or characterization from Anthropic; this makes the infrastructural conclusion circular, as the pattern match is assumed to support the broader argument rather than demonstrated.
- [Validated case studies] Validated case studies section: Although the manuscript states that COBALT produces SAT verdicts with concrete witnesses, UNSAT guarantees, reproducible encodings, and verified solver output on NASA cFE, wolfSSL, Mosquitto, and F Prime, no actual Z3 encodings, integer bounds, overflow sites, or solver logs are provided; this prevents verification of the claims and undermines the assertion that these results generalize to frontier-model sandbox networking code without additional unstated constraints on code structure or mitigations.
minor comments (2)
- [Proposed framework] The four-layer framework introduces the terms VERDICT, DIRECTIVE-4, and SENTINEL without sufficient definitions, mappings to the failure modes, or comparison to existing standards, reducing clarity of the proposed contribution.
- [Abstract] The abstract references 'acknowledged security outcomes' for the four case studies but provides no citations, descriptions of the original vulnerabilities, or links to how the SAT results align with known issues in those codebases.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. We address each major comment below, providing clarifications and indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] Abstract and proposed contributions: The claim that the Mythos escape class is consistent with a Z3-expressible CWE-190 arithmetic formulation (and thus that pre-deployment analysis would have surfaced the pattern) rests on treating secondary accounts as unverified hypotheses without independent grounding or characterization from Anthropic; this makes the infrastructural conclusion circular, as the pattern match is assumed to support the broader argument rather than demonstrated.
Authors: We explicitly state in the manuscript that we treat the Mythos escape as unverified and analyze the vulnerability class rather than the specific incident. The consistency argument is presented under explicit assumptions as an illustrative example of how a Z3-expressible pattern could apply, not as a demonstrated fact about the incident itself. The infrastructural conclusion—that frontier-model safety requires formal verification of the containment stack—is advanced as a general claim independent of the Mythos details. To prevent any appearance of circularity, we will revise the abstract and introduction to more prominently separate the hypothetical example, restate the assumptions, and clarify that the broader argument does not rely on the specific incident being proven. revision: partial
-
Referee: [Validated case studies] Validated case studies section: Although the manuscript states that COBALT produces SAT verdicts with concrete witnesses, UNSAT guarantees, reproducible encodings, and verified solver output on NASA cFE, wolfSSL, Mosquitto, and F Prime, no actual Z3 encodings, integer bounds, overflow sites, or solver logs are provided; this prevents verification of the claims and undermines the assertion that these results generalize to frontier-model sandbox networking code without additional unstated constraints on code structure or mitigations.
Authors: We agree that the absence of the concrete Z3 encodings, integer bounds, overflow sites, and solver logs in the manuscript prevents independent verification of the validated case studies. Although the text claims reproducibility, these artifacts are not included. We will add the specific encodings, bounds, sites, and logs as a dedicated appendix in the revised manuscript. This addition will also allow us to explicitly note any code-structure assumptions relevant to generalization toward sandbox networking code. revision: yes
Circularity Check
No significant circularity; Mythos argument is explicitly conditional on unverified assumptions
full rationale
The paper states it treats the Mythos escape as unverified and analyzes the vulnerability class rather than the specific incident. Validated contributions consist of COBALT detections on four independent production codebases (NASA cFE, wolfSSL, Mosquitto, F Prime) with reproducible SAT/UNSAT outputs and acknowledged security outcomes. The Mythos consistency claim is introduced only 'under explicit assumptions' and does not serve as a load-bearing premise for the validated results or the infrastructural recommendation. No equations, self-citations, fitted parameters, or ansatzes reduce any derivation to its own inputs by construction. The central claim remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- explicit safety bounds
axioms (1)
- domain assumption Z3 correctly encodes C/C++ integer semantics for the targeted CWE patterns under the chosen bounds
invented entities (3)
-
VERDICT
no independent evidence
-
DIRECTIVE-4
no independent evidence
-
SENTINEL
no independent evidence
Reference graph
Works this paper leans on
-
[1]
April 2026
Anthropic.Project Glasswing: Securing Critical Soft- ware for the AI Era. April 2026. https://www. anthropic.com/glasswing
2026
-
[2]
April 2026
SC Media.Claude Mythos Preview identifies 27- year-old bug, finds thousands of zero-days in weeks. April 2026. https://www.scworld.com/news/ anthropic-claude-mythos-preview-finds-thousands-of-vulnerabilities-in-weeks
2026
-
[3]
It Cost Less Than $50 to Find It.Predict / Medium, April
Ellis, M.An AI Found a 27-Year-Old Bug Hiding in OpenBSD. It Cost Less Than $50 to Find It.Predict / Medium, April
-
[4]
https://medium.com/predict/ an-ai-found-a-27-year-old-bug-hiding-in-openbsd-it-cost-less-than-50-to-find-it-489064e9178c
-
[5]
GenAI / Medium, April
Vedi, S.Claude Mythos: The AI That Hacked Every OS and Escaped Its Own Cage. GenAI / Medium, April
-
[6]
https://medium.com/@shubhamnv2/ claude-mythos-the-ai-that-hacked-every-os-and-escaped-its-own-cage-2eabae94b898
-
[7]
wolfSSL Pull Request #10096: Fix signed left-shift UB in ML-DSA w1 coefficient encoding
Adelsbach, P. wolfSSL Pull Request #10096: Fix signed left-shift UB in ML-DSA w1 coefficient encoding. March 2026. https://github.com/wolfSSL/ wolfssl/pull/10096
2026
-
[8]
Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code
Blain, D.Broken by Default: A Z3 Formal Verification Study of AI-Generated C/C++ Code. arXiv:2604.05292, April 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
AAAI Workshop on AI and Ethics, 2015
Soares, N., Fallenstein, B., Yudkowsky, E., and Arm- strong, S.Corrigibility. AAAI Workshop on AI and Ethics, 2015
2015
-
[10]
et al.Cooperative Inverse Reinforce- ment Learning
Hadfield-Menell, D. et al.Cooperative Inverse Reinforce- ment Learning. NeurIPS, 2016
2016
-
[11]
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
Ruan, Y . et al.Identifying the Risks of LM Agents with an LM-Emulated Sandbox. arXiv:2309.15817, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
et al.seL4: Formal Verification of an OS Kernel
Klein, G. et al.seL4: Formal Verification of an OS Kernel. ACM SOSP, 2009
2009
-
[13]
et al.CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels
Gu, R. et al.CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels. USENIX OSDI, 2016
2016
-
[14]
USENIX OSDI, 2008
Cadar, C., Dunbar, D., and Engler, D.KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. USENIX OSDI, 2008. 11
2008
-
[15]
et al.SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis
Shoshitaishvili, Y . et al.SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. IEEE S&P, 2016
2016
-
[16]
Viking, 2019
Russell, S.Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019
2019
-
[17]
When RV Meets CEP
Hall´e, S. “When RV Meets CEP.”Proc. 16th Int’l Conf. on Runtime Verification (RV’16), Lecture Notes in Computer Science, vol. 10012, Springer, 2016, pp. 232–238. 12
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.