arxiv: 2605.08927 · v1 · submitted 2026-05-09 · 💻 cs.PL

Recognition: 2 theorem links

· Lean Theorem

Quantitative Comparison of Credible Compilation and Verification In Coding Agent Compiler Development

Martin Rinard

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:02 UTC · model grok-4.3

classification 💻 cs.PL

keywords compiler verificationcredible compilationtranslation validationcoding agentsformal verification effortoptimization scopequantitative comparison

0 comments

The pith

Verification requires roughly ten times more development effort than credible compilation when a coding agent builds compiler optimizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures the practical costs of two approaches to making compiler optimizations trustworthy: credible compilation, which validates each run by checking that the optimized code matches the original, and full verification, which proves correctness once for all inputs. A coding agent under supervision implemented unreachable code elimination, dead assignment elimination, and constant propagation using both methods. The measurements show that full verification demands far more coding sessions and supervision time, prompts simpler algorithms and narrower scopes to reduce proof work, and incurs heavy certificate-checking overhead at runtime. If these effort differences hold, credible compilation offers a lighter path to reliable optimizations while full verification remains feasible but more expensive for the optimizations studied.

Core claim

When the same three optimizations were implemented by a coding agent, credible compilation required substantially less effort than full verification, with verification taking an order of magnitude more development time, causing the agent to select less efficient algorithms and data structures for the verified versions and to repeatedly reduce optimization scope to limit proof effort; certificate checking dominated the runtime cost of the verified optimizations.

What carries the argument

Side-by-side implementation of the same optimizations under both credible compilation (translation validation) and full verification, with quantitative tracking of coding sessions, supervision time, algorithm choices, and scope reductions.

If this is right

Verified optimizations required more coding sessions and supervision time than credible-compilation versions.
The coding agent selected less efficient algorithms and data structures to make full verification feasible.
Repeated scope reductions were introduced specifically to reduce the proof burden for verified optimizations.
Certificate checking time exceeded optimization and certificate generation time for the verified versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These results suggest that hybrid strategies using credible compilation for most cases and targeted full verification only where needed could lower overall costs.
Coding agents with stronger built-in proof capabilities might narrow the effort gap observed here.
The same quantitative comparison could be repeated on larger or more complex optimizations to test whether the tenfold effort difference persists.

Load-bearing premise

The effort differences measured by coding sessions and supervision time reflect intrinsic differences between credible compilation and full verification rather than the particular abilities of the coding agent or the three chosen optimizations.

What would settle it

A second coding agent or human developer implementing the identical optimizations shows effort ratios and scope reductions that differ substantially from the reported order-of-magnitude gap.

read the original abstract

Formal program verification is a longstanding goal in the field. We present the first quantitative comparison of the two primary compiler verification approaches, credible compilation/translation validation and full verification. Working with the first verified compiler developed by a coding agent (operating under human supervision), we present quantitative results from a coding agent implementing several optimizations using these two approaches. The results indicate that 1) verification requires roughly an order of magnitude more development effort than credible compilation, 2) to enhance provability, the coding agent chooses less efficient algorithms and data structures for verified optimizations, and 3) in an attempt to minimize proof effort the coding agent repeatedly implemented optimization scope reductions for verified optimizations, and 4) certificate checking time dominates optimization and certificate generation time for the considered optimizations. Because of the increased proof overhead, verified optimizations required substantially more supervision and coding sessions than credible compilation optimizations. Given the capabilities of a modern coding agent working in this context, implementation efforts for both credible compilation and verified versions remained feasible for the considered optimizations (unreachable code elimination, dead assignment elimination, and constant propagation/folding).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 10x effort gap for verification over credible compilation is real in this case but tied to one agent's simplifications on three optimizations, making it more instance-specific than general.

read the letter

The main thing to know is that this paper gives concrete numbers showing full verification took roughly an order of magnitude more development effort and supervision than credible compilation when a coding agent built three optimizations. It also notes that the agent picked less efficient algorithms and narrowed optimization scopes to reduce proof work, and that certificate checking time dominated the rest of the process for unreachable code elimination, dead assignment elimination, and constant propagation/folding. This is positioned as the first quantitative comparison inside a coding-agent compiler project, which supplies new empirical observations on the practical trade-offs. The data on coding sessions, supervision time, and runtime costs for certificates versus generation is the useful part here, as it shows verified versions remained feasible but required more human oversight and led to compromises in algorithm choice. That kind of direct measurement is worth having for anyone thinking about mixing formal methods with AI coding tools. The soft spot is that the setup uses a single coding agent and only three optimizations, with the agent explicitly adapting its choices to ease verification. This means the reported gap mixes the cost of producing machine-checkable proofs with the agent's particular heuristics and the supervisor's guidance, rather than isolating an inherent difference between the two approaches. No human-only baseline, no ablation across agents, and no broader set of optimizations appear, so the order-of-magnitude claim stays tied to this narrow instance. The abstract also leaves the exact measurement protocol and raw data unclear, which makes it harder to judge robustness. This work would interest people working on verified compilers or AI-assisted formal methods, especially if the full paper includes the session logs and implementation details. It deserves peer review to let referees check the methods and see how much qualification the results need around the limited scope.

Referee Report

2 major / 2 minor

Summary. The paper presents the first quantitative comparison of credible compilation (translation validation) versus full verification for compiler optimizations, implemented by a single coding agent under human supervision. For unreachable code elimination, dead assignment elimination, and constant propagation/folding, it reports that verification requires roughly an order of magnitude more development effort, leads the agent to select less efficient algorithms and data structures, prompts repeated optimization scope reductions to minimize proof burden, and that certificate checking time dominates; both approaches remain feasible given modern coding agent capabilities.

Significance. If substantiated, the work provides a timely empirical baseline on relative costs of verification strategies in AI-assisted compiler development, highlighting practical trade-offs in effort, algorithmic efficiency, and provability. It explicitly credits the feasibility of both approaches within current coding agent limits and focuses on concrete, falsifiable effort metrics from three optimizations. This could inform design choices in verified compilers, though the single-agent scope makes the findings preliminary rather than general.

major comments (2)

Abstract and results section: the headline claim that 'verification requires roughly an order of magnitude more development effort' is stated without any measurement protocol for coding sessions or supervision time, raw data tables, error bars, exclusion criteria, or statistical justification. This prevents assessment of whether the factor of ~10 is robust or affected by post-hoc choices in the agent's implementations.
Methodology and discussion: the effort comparison is derived from one coding agent across only three optimizations, with the agent itself choosing simpler algorithms and narrowing scopes specifically to reduce proof burden. No ablation across agents, no human-only baseline, and no additional optimizations are reported, so the quantitative difference cannot be separated from this specific instance's heuristics and supervisor guidance.

minor comments (2)

Abstract point 4: the statement that certificate checking time dominates is presented without a quantitative breakdown (e.g., percentages or absolute times) comparing it to optimization and certificate generation times for the three cases.
Introduction: the term 'credible compilation' is used without a concise definition or pointer to the translation-validation literature, which may confuse readers unfamiliar with the distinction from full verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our quantitative comparison of credible compilation and full verification using a coding agent. The comments correctly identify areas where additional transparency is needed. We have revised the manuscript to incorporate a detailed effort measurement protocol, raw data tables, and expanded limitations discussion. We address each major comment below.

read point-by-point responses

Referee: Abstract and results section: the headline claim that 'verification requires roughly an order of magnitude more development effort' is stated without any measurement protocol for coding sessions or supervision time, raw data tables, error bars, exclusion criteria, or statistical justification. This prevents assessment of whether the factor of ~10 is robust or affected by post-hoc choices in the agent's implementations.

Authors: We agree that the original submission lacked sufficient methodological detail to fully substantiate the quantitative claim. In the revised version, we have added a dedicated 'Effort Measurement Protocol' subsection in the Methodology section. This defines coding sessions as discrete agent interactions until a functional implementation is achieved, logs supervision time as human oversight minutes, and specifies completion criteria. We now include Table 2 with raw data for all three optimizations, showing session counts, total supervision hours, and breakdowns for both approaches. Observed effort ratios range from 7.8x to 11.4x. While the single-run nature precludes error bars or formal statistics, we discuss consistency across optimizations and document observed agent choices transparently rather than as post-hoc adjustments. The abstract has been updated to 'approximately 8-12 times more development effort' for precision. revision: yes
Referee: Methodology and discussion: the effort comparison is derived from one coding agent across only three optimizations, with the agent itself choosing simpler algorithms and narrowing scopes specifically to reduce proof burden. No ablation across agents, no human-only baseline, and no additional optimizations are reported, so the quantitative difference cannot be separated from this specific instance's heuristics and supervisor guidance.

Authors: We acknowledge the limited scope as a genuine constraint of this initial study. The paper already frames the work as the first such comparison and we have expanded the Discussion with a new 'Limitations and Scope' paragraph. The agent's choices of simpler algorithms and scope reductions are explicitly reported as empirical observations (results 2 and 3), illustrating a key trade-off rather than a confound. We note the absence of agent ablations, human-only baselines, and further optimizations due to resource limits in this preliminary work, and suggest these as future directions. Supervisor guidance was restricted to task specification and debugging, as described in Section 3. The trend's consistency across three distinct optimizations provides initial support, though we agree broader validation is required. revision: partial

Circularity Check

0 steps flagged

Empirical effort comparison contains no derivation chain or self-referential reduction

full rationale

The paper reports direct measurements of coding sessions, supervision time, algorithm choices, and scope reductions from a single coding agent's supervised implementation of three optimizations under two approaches. No equations, fitted parameters, predictions derived from inputs, or self-citations appear as load-bearing steps in any derivation. Claims rest on tabulated effort counts and qualitative observations of the agent's behavior rather than any reduction of results to their own inputs by construction. The study is therefore self-contained against external benchmarks of implementation effort.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

With only the abstract available, the ledger captures the minimal assumptions required to interpret the stated quantitative claims.

axioms (2)

domain assumption Development effort can be consistently quantified via number of coding sessions and supervision time across both verification approaches.
Implicit in the abstract's comparison of effort and supervision requirements.
domain assumption The coding agent's behavior and optimization choices are comparable when implementing credible compilation versus full verification.
Required to attribute observed differences to the verification method rather than agent variability.

pith-pipeline@v0.9.0 · 5480 in / 1362 out tokens · 73349 ms · 2026-05-12T02:02:17.948939+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

quantitative comparison of credible compilation/translation validation and full verification... verification requires roughly an order of magnitude more development effort
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

all algorithms, code, and proofs were developed by a coding agent... Lean 4

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

A. V . Aho, M. S. Lam, R. Sethi, and J. D. Ullman.Compilers: Principles, Techniques, and Tools. Pearson Addison-Wesley, Boston, MA, USA, 2nd edition, 2006. ISBN 978-0-321-48681-3

work page 2006
[2]

A. W. Appel and J. Palsberg.Modern Compiler Implementation in Java. Cambridge University Press, 2nd edition, 2002. ISBN 978-0-521-82060-4

work page 2002
[3]

A. W. Appel, R. Dockins, A. Hobor, L. Beringer, J. Dodds, G. Stewart, S. Blazy, and X. Leroy. Program Logics for Certified Compilers. Cambridge University Press, New York, NY , USA, 2014

work page 2014
[4]

Chlipala

A. Chlipala. A verified compiler for an impure functional language. InProceedings of the 37th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 93–106, 2010

work page 2010
[5]

K. D. Cooper and L. Torczon.Engineering a Compiler. Morgan Kaufmann, 2nd edition, 2011. ISBN 978-0-12-088478-0

work page 2011
[6]

M. J. C. Gordon and T. F. Melham, editors.Introduction to HOL: A Theorem Proving Environ- ment for Higher Order Logic. Cambridge University Press, 1993

work page 1993
[7]

Jourdan, F

J.-H. Jourdan, F. Pottier, and X. Leroy. Validating LR(1) parsers. InProgramming Languages and Systems (ESOP 2012), volume 7211 ofLecture Notes in Computer Science, pages 397–416. Springer, 2012. doi: 10.1007/978-3-642-28869-2_20. URL https://xavierleroy.org/ publi/validated-parser.pdf

work page doi:10.1007/978-3-642-28869-2_20 2012
[8]

J. Kang, Y . Kim, Y . Song, J. Lee, S. Park, M. D. Shin, Y . Kim, S. Cho, J. Choi, C.-K. Hur, and K. Yi. Crellvm: Verified credible compilation for LLVM. InProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM,

work page
[9]

URLhttps://sf.snu.ac.kr/crellvm/

work page
[10]

X. Leroy. A formally verified compiler back-end.Journal of Automated Reasoning, 43(4):363– 446, 2009. doi: 10.1007/s10817-009-9155-4. URL https://xavierleroy.org/publi/ compcert-backend.pdf

work page doi:10.1007/s10817-009-9155-4 2009
[11]

X. Leroy. Formal verification of a realistic compiler.Communications of the ACM, 52(7): 107–115, 2009

work page 2009
[12]

X. Leroy. Formal verification of a realistic compiler.Communications of the ACM, 52(7): 107–115, 2009. doi: 10.1145/1538788.1538814. URL https://xavierleroy.org/publi/ compcert-CACM.pdf

work page doi:10.1145/1538788.1538814 2009
[13]

N. P. Lopes, J. Lee, C. Hur, Z. Liu, and J. Regehr. Alive2: bounded translation validation for LLVM. In S. N. Freund and E. Yahav, editors,PLDI ’21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Virtual Event, Canada, June 20-25, 2021, pages 65–79. ACM, 2021

work page 2021
[14]

D. Marinov. Credible compilation. S.M. thesis, Massachusetts Institute of Technology, Depart- ment of Electrical Engineering and Computer Science, Cambridge, Massachusetts, Sept. 2000. URL[dspace.mit.edu](https://dspace.mit.edu/handle/1721.1/86621)

work page 2000
[15]

McCarthy and J

J. McCarthy and J. Painter. Correctness of a compiler for arithmetic expressions. In J. T. Schwartz, editor,Mathematical Aspects of Computer Science, volume 19 ofProceedings of Symposia in Applied Mathematics. American Mathematical Society, Providence, Rhode Island, 1967

work page 1967
[16]

F. H. McMahon. The Livermore Fortran Kernels: A computer test of the numerical performance range. Technical Report UCRL-53745, Lawrence Livermore National Laboratory, Livermore, CA, Dec. 1986. Source distributed via netlib at https://www.netlib.org/benchmark/ livermore

work page 1986
[17]

Milner and R

R. Milner and R. Weyhrauch. Proving compiler correctness in a mechanized logic. In B. Meltzer and D. Michie, editors,Machine Intelligence 7, pages 51–72. Edinburgh University Press, 1972. 10

work page 1972
[18]

Monniaux and C

D. Monniaux and C. Six. Simple, light, yet formally verified, global common subexpression elimination and loop-invariant code motion. InProc. 22nd ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2021), pages 85–96, 2021. doi: 10.1145/3461648.3463850. URLhttps://arxiv.org/abs/2105.01344

work page doi:10.1145/3461648.3463850 2021
[19]

S. S. Muchnick.Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco, CA, USA, 1997. ISBN 978-1-55860-320-2

work page 1997
[20]

Pnueli, M

A. Pnueli, M. Siegel, and E. Singerman. Translation validation. InProceedings of the 4th International Conference on Tools and Algorithms for Construction and Analysis of Systems (TACAS), volume 1384 ofLecture Notes in Computer Science, pages 151–166. Springer, 1998

work page 1998
[21]

Rideau and X

S. Rideau and X. Leroy. Validating register allocation and spilling. InCompiler Construction (CC 2010), volume 6011 ofLNCS, pages 224–243, 2010. doi: 10.1007/978-3-642-11970-5_13. URLhttps://xavierleroy.org/publi/validation-regalloc.pdf

work page doi:10.1007/978-3-642-11970-5_13 2010
[22]

M. Rinard. Testing, credible compilation, and verification in the axon verified compiler in lean and claude code, 2026. URLhttps://arxiv.org/abs/2605.01660

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

M. C. Rinard. Credible compilation. Technical Report MIT-LCS-TR-776, MIT Laboratory for Computer Science, 1999

work page 1999
[24]

M. C. Rinard and D. Marinov. Credible compilation with pointers. InProceedings of the FLoC Workshop on Run-Time Result Verification (RTRV), 1999

work page 1999
[25]

T. A. L. Sewell, M. O. Myreen, and G. Klein. Translation validation for a verified OS ker- nel. InProc. 34th ACM SIGPLAN Conference on Programming Language Design and Im- plementation (PLDI 2013), pages 471–482, 2013. doi: 10.1145/2491956.2462183. URL https://trustworthy.systems/publications/nicta_full_text/6722.pdf

work page doi:10.1145/2491956.2462183 2013
[26]

C. Six, S. Boulmé, and D. Monniaux. Certified and efficient instruction scheduling: Application to interlocked VLIW processors.Proc. ACM Program. Lang., 4(OOPSLA):1–29, 2020. doi: 10.1145/3428197. URLhttps://hal.science/hal-02185883

work page doi:10.1145/3428197 2020
[27]

C. Six, L. Gourdin, S. Boulmé, D. Monniaux, J. Fasse, and N. Nardino. Formally verified superblock scheduling. InProc. 11th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP 2022), pages 40–54, 2022. doi: 10.1145/3497775.3503679. URL https://www-verimag.imag.fr/~boulme/CPP_2022/

work page doi:10.1145/3497775.3503679 2022
[28]

Y . K. Tan, M. O. Myreen, R. Kumar, A. Fox, S. Owens, and M. Norrish. The verified CakeML compiler backend.Journal of Functional Programming, 29:e2, 2019

work page 2019
[29]

The Coq proof assistant, version 8.20, 2024

The Coq Development Team. The Coq proof assistant, version 8.20, 2024. URL https: //coq.inria.fr

work page 2024
[30]

Tristan and X

J. Tristan and X. Leroy. Verified validation of lazy code motion. In M. Hind and A. Diwan, editors,Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2009, Dublin, Ireland, June 15-21, 2009, pages 316–326. ACM, 2009

work page 2009
[31]

Tristan and X

J.-B. Tristan and X. Leroy. Formal verification of translation validators: A case study on instruction scheduling optimizations. InProceedings of the 35th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 17–27, 2008

work page 2008
[32]

Tristan and X

J.-B. Tristan and X. Leroy. Verified validation of lazy code motion. InProc. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2009), pages 316– 326, 2009. doi: 10.1145/1542476.1542512. URL https://xavierleroy.org/publi/ validation-LCM.pdf

work page doi:10.1145/1542476.1542512 2009
[33]

Tristan and X

J.-B. Tristan and X. Leroy. A simple, verified validator for software pipelining. InProc. 37th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2010), pages 83–92, 2010. doi: 10.1145/1706299.1706311. URL https://xavierleroy.org/ publi/validation-softpipe.pdf. 11

work page doi:10.1145/1706299.1706311 2010
[34]

Z. Yang, J. Shirako, and V . Sarkar. Fully verified instruction scheduling.Proc. ACM Program. Lang., 8(OOPSLA2):791–816, 2024. doi: 10.1145/3689739. URL https://dl.acm.org/ doi/10.1145/3689739

work page doi:10.1145/3689739 2024
[35]

L. D. Zuck, A. Pnueli, Y . Fang, and B. Goldberg. VOC: A translation validator for optimizing compilers.Electronic Notes in Theoretical Computer Science (ENTCS), 65(2):2–18, 2002. Proceedings of COCV 2002

work page 2002
[36]

L. D. Zuck, A. Pnueli, B. Goldberg, C. W. Barrett, Y . Fang, and Y . Hu. Validating optimizing compilers.Electronic Notes in Theoretical Computer Science (ENTCS), 141(2):37–53, 2005. Proceedings of COCV 2004. 12 10 Appendix Table 3: Sample standard deviations (in milliseconds, denominator n−1 = 17 ) for each cell of table 2. CP UCE+DAE CC CC Baseline comp...

work page 2005