pith. machine review for the scientific record. sign in

arxiv: 2605.12731 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: no theorem link

Finding a Crab in the C: Assured Translation via Comparative Symbolic Execution

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:51 UTC · model grok-4.3

classification 💻 cs.SE
keywords codetranslationautomatedbinarycozysafesystemsbinaries
0
0 comments X

The pith

cozy performs comparative symbolic execution on original and translated binaries to flag differences for developer review while proving equivalence elsewhere.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces cozy, a tool designed to help developers safely translate legacy code written in unsafe languages like C into memory-safe languages such as Rust or SPARK. Instead of trusting an automated translator completely, cozy compiles both the original and translated versions into binaries. It then runs symbolic execution on both binaries at the same time to explore their possible behaviors and identify any points where they diverge. Each difference is presented to the developer, who decides whether it represents an intentional change or an error introduced during translation. Once the developer has reviewed and approved the differences, the tool provides a formal guarantee that the two binaries match in all other respects. This process reduces the risk associated with automated translation by turning the problem into a manageable review task rather than requiring full trust in the translator. The approach is positioned for use in high-assurance systems where rewriting large amounts of existing C code would be too expensive or risky.

Core claim

Outside of the flagged differences, the binaries are formally verified to be equivalent. Consequently, the review process guarantees equivalence modulo changes approved by the developer.

Load-bearing premise

That symbolic execution on the compiled binaries can exhaustively detect all semantic differences without missing behaviors or generating unmanageable numbers of spurious differences, and that the binaries faithfully represent the source-level semantics.

Figures

Figures reproduced from arXiv: 2605.12731 by Caleb Helbling, Graham Leach-Krouse, Michael Crystal.

Figure 1
Figure 1. Figure 1: The architecture of how cozy uses comparative symbolic execution in its analysis. The resulting diff object can be used to generate a textual report, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: In this example we interactively explore the execution trees of a C [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Modern high-assurance software systems development favors memory safe languages such as SPARK (ADA) or Rust. However, developers often encounter non-memory safe code (e.g., C) in legacy systems and libraries which would be prohibitively expensive or risky to re-write. In response, developers have begun turning to machine learning/AI systems and other automated code translators. Automated translation comes with its own risks, however. The original and ported code are not precisely the same, semantically - otherwise there would be no point in performing the translation. To reduce these risks, we have developed cozy, a comparative binary analysis tool that simultaneously analyzes a binary compiled from "unsafe" source code and a binary compiled from a translation of the source code to a memory safe language. cozy walks the developer through differences in the behavior of the two binaries, presenting each difference and asking the user to assess whether the difference is intentional (good) or erroneous. Outside of the flagged differences, the binaries are formally verified to be equivalent. Consequently, the review process guarantees equivalence modulo changes approved by the developer. cozy has applications to automated translation, bug correction, code reviews, operation authorization, and automatic translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces cozy, a comparative symbolic execution tool that simultaneously analyzes binaries compiled from original C source and from its automated translation to a memory-safe language. It presents behavioral differences to the developer for approval or rejection as intentional changes; outside the flagged differences, the binaries are claimed to be formally verified equivalent, thereby guaranteeing equivalence modulo developer-approved modifications. The approach targets applications including automated translation assurance, bug correction, and code review.

Significance. If the central claim holds, the work would offer a practical method for reducing semantic risks in porting legacy unsafe code to safe languages, which is a pressing need in high-assurance systems. It extends standard symbolic execution to a comparative setting and could support reproducible review processes. However, the absence of any completeness argument, coverage metrics, or handling for path explosion substantially weakens the assessed significance, as the equivalence guarantee rests on unproven exhaustiveness assumptions.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (comparative symbolic execution description): the claim that binaries are 'formally verified to be equivalent' outside flagged differences lacks any algorithm outline, termination condition, or completeness argument for handling loops, recursion, pointer aliasing, and input-dependent control flow; standard symbolic execution on binaries is known to require bounds that can miss behaviors, directly undermining the verification guarantee.
  2. [§5] §5 (evaluation): no coverage metrics, path counts, or data on spurious differences are reported, leaving the weakest assumption—that symbolic execution exhaustively detects all semantic divergences without unmanageable false positives—unsupported for realistic programs.
  3. [§2] §2 (background and assumptions): the load-bearing premise that compiled binaries faithfully preserve source-level semantics is not addressed with respect to compiler optimizations or undefined behavior, which can introduce discrepancies invisible at the source level and outside the scope of the comparative analysis.
minor comments (2)
  1. [Abstract] The abstract and introduction use 'cozy' without an initial definition or expansion; add a parenthetical on first use.
  2. [§3] Notation for symbolic states and difference flags is introduced without a summary table; a small notation table would improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments. We address each major point below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (comparative symbolic execution description): the claim that binaries are 'formally verified to be equivalent' outside flagged differences lacks any algorithm outline, termination condition, or completeness argument for handling loops, recursion, pointer aliasing, and input-dependent control flow; standard symbolic execution on binaries is known to require bounds that can miss behaviors, directly undermining the verification guarantee.

    Authors: We agree that §3 would benefit from greater detail. In the revision we will expand the algorithm description to explicitly outline the steps of comparative symbolic execution, state the configurable bounds on loop iterations and recursion depth used to guarantee termination, describe the alias analysis employed for pointers, and clarify that equivalence is formally established only over the explored paths. The 'formal verification' claim will be qualified to apply within these practical bounds rather than claiming unbounded exhaustiveness. revision: yes

  2. Referee: [§5] §5 (evaluation): no coverage metrics, path counts, or data on spurious differences are reported, leaving the weakest assumption—that symbolic execution exhaustively detects all semantic divergences without unmanageable false positives—unsupported for realistic programs.

    Authors: We accept that the evaluation section is currently light on quantitative data. We will revise §5 to include path counts, statement and branch coverage figures for each benchmark, and statistics on the number of differences surfaced together with how many were classified as spurious by the human reviewers. These additions will directly support the practicality claims. revision: yes

  3. Referee: [§2] §2 (background and assumptions): the load-bearing premise that compiled binaries faithfully preserve source-level semantics is not addressed with respect to compiler optimizations or undefined behavior, which can introduce discrepancies invisible at the source level and outside the scope of the comparative analysis.

    Authors: The tool operates directly on the compiled binaries; therefore any optimization or undefined-behavior effects present in the binaries are part of the semantics being compared and will be detected as differences if they diverge between the two binaries. We will add a short clarifying paragraph in §2 making this scope explicit and noting that the approach can surface compilation-induced discrepancies. revision: partial

standing simulated objections not resolved
  • A general, unbounded completeness argument for symbolic execution; the implementation relies on user-configurable bounds for termination and scalability, and no such argument can be supplied without additional program restrictions.

Circularity Check

0 steps flagged

No circularity: equivalence claim rests on standard symbolic execution without self-referential definitions or fitted predictions

full rationale

The paper introduces the cozy tool for comparative symbolic execution on binaries from original C and translated safe-language code. It identifies differences for developer review and claims formal equivalence outside flagged cases. This rests on applying established symbolic execution to detect behavioral divergences rather than deriving results from self-defined parameters, renaming known patterns, or load-bearing self-citations. No equations reduce the verification result to its own inputs by construction, and the method does not invoke uniqueness theorems or ansatzes from prior author work. The derivation chain is self-contained against external symbolic execution techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the effectiveness of symbolic execution for binary comparison and the assumption that binaries preserve source semantics; no free parameters or invented entities beyond the tool itself are introduced.

axioms (2)
  • domain assumption Compiled binaries accurately reflect the semantics of their source code.
    Required for differences detected at binary level to correspond to source-level translation issues.
  • domain assumption Symbolic execution can be applied to the binaries to explore and compare their behaviors.
    Core mechanism for identifying differences and proving equivalence.
invented entities (1)
  • cozy no independent evidence
    purpose: Tool for comparative symbolic execution on translated code binaries
    The main new artifact introduced to perform the assured translation process.

pith-pipeline@v0.9.0 · 5508 in / 1308 out tokens · 61369 ms · 2026-05-14T19:51:06.929993+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Software memory safety,

    “Software memory safety,” National Security Agency, Tech. Rep., Apr. 2023. [Online]. Available: https : / / media . defense . gov / 2022 / Nov / 10 / 2003112742/ - 1/ - 1/0/CSI SOFTW ARE MEMORY SAFETY .PDF

  2. [2]

    Memory safe languages: Reducing vulnerabilities in modern software development,

    “Memory safe languages: Reducing vulnerabilities in modern software development,” National Security Agency, Tech. Rep., Jun. 2025. [Online]. Available: https://media.defense.gov/2025/Jun/23/2003742198/- 1/ - 1 / 0 / CSI MEMORY SAFE LANGUAGES REDUCING VULNERABILITIES IN MODERN SOFTW ARE DEVELOPMENT.PDF

  3. [3]

    The case for memory safe roadmaps: Why both c-suite executives and technical experts need to take memory safe coding seriously,

    “The case for memory safe roadmaps: Why both c-suite executives and technical experts need to take memory safe coding seriously,” Cybersecurity and Infrastructure Security Agency, Tech. Rep., Dec. 2023. [Online]. Available: https://www.cisa.gov/sites/default/files/2023- 12/The-Case-for-Memory-Safe-Roadmaps-508c.pdf

  4. [4]

    Back to building blocks: A path towards secure and measurable software,

    “Back to building blocks: A path towards secure and measurable software,” Office of the National Cyber Director, Tech. Rep., Feb. 2024. [Online]. Available: https://web.archive.org/web/20240227002908/https: //www.whitehouse.gov/wp- content/uploads/2024/02/ Final-ONCD-Technical-Report.pdf

  5. [5]

    [Online]

    Galois and Immunant,C2rust demonstration. [Online]. Available: https://c2rust.com/

  6. [6]

    SafeTrans: LLM-assisted transpilation from C to rust,

    M. Farrukh, S. Shah, B. Coskun, and M. Polychronakis, “SafeTrans: LLM-assisted transpilation from C to rust,” arXiv [cs.CR], May 2025

  7. [7]

    Feedback loops and code perturbations in LLM-based software engineering: A case study on a C-to-rust translation system,

    M. Weiss, J. Hecking-Harbusch, J. Quante, and M. Woehrle, “Feedback loops and code perturbations in LLM-based software engineering: A case study on a C-to-rust translation system,”arXiv [cs.SE], Dec. 2025

  8. [8]

    Integrating rules and semantics for LLM- based C-to-rust translation,

    F. Luo et al., “Integrating rules and semantics for LLM- based C-to-rust translation,” in2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, Sep. 2025, pp. 685–696

  9. [9]

    SACTOR: LLM-driven correct and idiomatic C to rust translation with static analysis and FFI-based verification,

    T. Zhou et al., “SACTOR: LLM-driven correct and idiomatic C to rust translation with static analysis and FFI-based verification,”arXiv [cs.SE], Dec. 2025

  10. [10]

    Cozy: Comparative symbolic execution for binary programs,

    C. Helbling, G. Leach-Krouse, S. Lasser, and G. Sulli- van, “Cozy: Comparative symbolic execution for binary programs,” inProceedings of the 2025 Workshop on Binary Analysis Research, Reston, V A: Internet Soci- ety, 2025.DOI: 10 . 14722 / bar. 2025 . 23004 [Online]. Available: https://doi.org/10.14722/bar.2025.23004

  11. [11]

    SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis,

    Y . Shoshitaishvili et al., “SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis,” inIEEE Symposium on Security and Privacy, 2016.DOI: 10 . 1109/SP.2016.17 [Online]. Available: https://doi.org/10. 1109/SP.2016.17

  12. [12]

    Differential Symbolic Execution,

    S. Person, M. B. Dwyer, S. Elbaum, and C. S. P˘as˘areanu, “Differential Symbolic Execution,” inFoun- dations of Software Engineering (FSE), 2008,ISBN: 9781595939951.DOI: 10 . 1145 / 1453101 . 1453131 [Online]. Available: https://doi.org/10.1145/1453101. 1453131

  13. [13]

    Relational Symbolic Execution,

    G. P. Farina, S. Chong, and M. Gaboardi, “Relational Symbolic Execution,” inPrinciples and Practice of Programming Languages (PPDP), 2019.DOI: 10.1145/ 3354166.3354175 [Online]. Available: https://doi.org/ 10.1145/3354166.3354175

  14. [14]

    Division by in- variant integers using multiplication,

    T. Granlund and P. L. Montgomery, “Division by in- variant integers using multiplication,” inProceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, ser. PLDI ’94, Orlando, Florida, USA: Association for Computing Machinery, 1994, 61–72,ISBN: 089791662X.DOI: 10. 1145/178243.178249 [Online]. Available: https://doi. ...

  15. [15]

    Verifying dynamic trait objects in rust,

    A. VanHattum, D. Schwartz-Narbonne, N. Chong, and A. Sampson, “Verifying dynamic trait objects in rust,” inProceedings of the 44th International Conference on Software Engineering: Software Engineering in Prac- tice, ser. ICSE-SEIP ’22, Pittsburgh, Pennsylvania: As- sociation for Computing Machinery, 2022, 321–330, ISBN: 9781450392266.DOI: 10.1145/3510457...

  16. [16]

    Vert: Verified equivalent rust transpilation with large language models as few-shot learners,

    A. Z. Yang, Y . Takashima, B. Paulsen, J. Dodds, and D. Kroening, “Vert: Verified equivalent rust transpilation with large language models as few-shot learners,”arXiv preprint arXiv:2404.18852, 2024

  17. [17]

    RustAssure: Differential sym- bolic testing for LLM-transpiled C-to-rust code,

    Y . Bai and T. Palit, “RustAssure: Differential sym- bolic testing for LLM-transpiled C-to-rust code,”arXiv [cs.SE], Oct. 2025