arxiv: 2604.05150 · v1 · submitted 2026-04-06 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

Geert Trooskens (1) , Aaron Karlsberg (1) , Anmol Sharma (1) , Lamara De Brouwer (1) , Max Van Puyvelde (2) , Matthew Young (1) , John Thickstun (3) , Gil Alterovitz (4)

show 12 more authors

Walter A. De Brouwer (2) ((1) XY.AI Labs Palo Alto CA (2) Stanford University School of Medicine Stanford (3) Cornell University Ithaca NY (4) Brigham Women's Hospital / Harvard Medical School Boston MA)

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:51 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords compiled AIdeterministic code generationLLM workflow automationtoken amortizationfunction callingdocument intelligencecode safetyenterprise workflows

0 comments

The pith

Large language models generate code once during compilation for deterministic execution in subsequent workflow runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes shifting from repeated large language model calls during workflow execution to a one-time compilation phase where models produce executable code artifacts. By embedding narrow business-logic functions inside validated templates, this trades some flexibility for gains in predictability, auditability, cost control, and security. The approach targets high-stakes enterprise settings such as healthcare, where reliability matters. Evaluations across function-calling and document intelligence benchmarks show the method matches or exceeds direct model performance while eliminating runtime tokens and cutting overall consumption sharply at scale.

Core claim

Compiled AI generates executable code artifacts during an initial compilation phase using a constrained four-stage generation-and-validation pipeline, after which workflows execute deterministically with no further model invocations. On function-calling tasks, the system reaches 96% completion with zero execution tokens, breaks even with runtime inference after roughly 17 transactions, and reduces token consumption by 57 times at 1,000 transactions. On document intelligence tasks, a Code Factory variant matches direct LLM accuracy on key field extraction at 80.0% while reaching 80.4% on line item recognition, accompanied by 96.7% accuracy on prompt injection detection and 87.5% on staticcode

What carries the argument

The four-stage generation-and-validation pipeline that converts probabilistic model output into production-ready code artifacts embedded in validated templates for narrow business-logic functions.

Load-bearing premise

That constraining generation to narrow business-logic functions inside validated templates produces production-ready code for complex enterprise workflows without unacceptable loss of coverage or adaptability.

What would settle it

A test set of novel enterprise workflows outside the template library where compiled code completion rates fall substantially below direct LLM invocation or where frequent recompilation becomes necessary.

Figures

Figures reproduced from arXiv: 2604.05150 by (2) Stanford University School of Medicine, (3) Cornell University, (4) Brigham, Aaron Karlsberg (1), Anmol Sharma (1), Boston, CA, Geert Trooskens (1), Gil Alterovitz (4), Ithaca, John Thickstun (3), Lamara De Brouwer (1), MA), Matthew Young (1), Max Van Puyvelde (2), NY, Palo Alto, Stanford, Walter A. De Brouwer (2) ((1) XY.AI Labs, Women's Hospital / Harvard Medical School.

**Figure 2.** Figure 2: The code foundry architecture. Business intent (YAML) enters; validated Temporal [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

We study compiled AI, a paradigm in which large language models generate executable code artifacts during a compilation phase, after which workflows execute deterministically without further model invocation. This paradigm has antecedents in prior work on declarative pipeline optimization (DSPy) and hybrid neural-symbolic planning (LLM+P); our contribution is a systems-oriented study of its application to high-stakes enterprise workflows, with particular emphasis on healthcare settings where reliability and auditability are critical. By constraining generation to narrow business-logic functions embedded in validated templates, compiled AI trades runtime flexibility for predictability, auditability, cost efficiency, and reduced security exposure. We introduce (i) a system architecture for constrained LLM-based code generation, (ii) a four-stage generation-and-validation pipeline that converts probabilistic model output into production-ready code artifacts, and (iii) an evaluation framework measuring operational metrics including token amortization, determinism, reliability, security, and cost. We evaluate on two task types: function-calling (BFCL, n=400) and document intelligence (DocILE, n=5,680 invoices). On function-calling, compiled AI achieves 96% task completion with zero execution tokens, breaking even with runtime inference at approximately 17 transactions and reducing token consumption by 57x at 1,000 transactions. On document intelligence, our Code Factory variant matches Direct LLM on key field extraction (KILE: 80.0%) while achieving the highest line item recognition accuracy (LIR: 80.4%). Security evaluation across 135 test cases demonstrates 96.7% accuracy on prompt injection detection and 87.5% on static code safety analysis with zero false positives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows measurable token savings and security gains from compiling LLM outputs to deterministic code on standard benchmarks, but the validation pipeline stays too high-level to fully assess reliability.

read the letter

The main thing to know is that constraining LLMs to generate code once inside templates, then running it without further calls, produces clear efficiency numbers: 96% task completion on BFCL function calling with zero runtime tokens, break-even around 17 transactions, and 57x token reduction at 1,000 runs. On DocILE document tasks the Code Factory version matches or beats direct LLM on key metrics while adding security checks that hit 96.7% on prompt injection and 87.5% on code safety with no false positives reported.

Referee Report

3 major / 2 minor

Summary. The paper introduces 'Compiled AI' as a paradigm in which LLMs generate executable code artifacts during a one-time compilation phase, after which workflows run deterministically with no further model invocations. It presents a constrained generation architecture, a four-stage validation pipeline, and an evaluation framework, reporting 96% task completion on BFCL (n=400) with zero execution tokens, break-even at ~17 transactions, 57x token reduction at 1,000 transactions, competitive KILE/LIR scores on DocILE (n=5,680), and high accuracy on 135 security test cases.

Significance. If the pipeline reliably converts probabilistic outputs to deterministic artifacts, the work provides concrete evidence of substantial operational gains in cost, predictability, and security for enterprise workflows, especially in regulated domains. Strengths include direct empirical measurements on public benchmarks, absence of fitted parameters, and explicit amortization arithmetic; these elements make the efficiency claims falsifiable and reproducible.

major comments (3)

[Methods / Pipeline Description] The four-stage generation-and-validation pipeline (introduced in the methods and used for all reported results) is described at a high level but provides no explicit rules for error handling, rejection criteria, or how model outputs are transformed into validated code artifacts; this detail is load-bearing for the determinism, 96% completion, and zero-execution-token claims.
[Security Evaluation] Security evaluation reports 96.7% prompt-injection detection and 87.5% static-code-safety accuracy with zero false positives across 135 cases, yet neither the construction of the test suite nor the specific static-analysis tools or thresholds are specified; without this, the robustness of the security advantage cannot be assessed.
[Evaluation / Operational Metrics] The amortization analysis (break-even at ~17 transactions, 57x reduction at 1,000) relies on the assumption that the compiled artifacts require no runtime model calls, but the manuscript does not report failure rates or fallback mechanisms when the generated code encounters out-of-distribution inputs; this affects the long-term reliability claim.

minor comments (2)

[Abstract and Results] The abstract and evaluation sections use 'zero execution tokens' without clarifying whether this excludes any initial validation or logging overhead.
[Results] Table or figure presenting the BFCL and DocILE metrics should include confidence intervals or statistical significance tests to support the cross-method comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your constructive feedback. We address each major comment point by point below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Methods / Pipeline Description] The four-stage generation-and-validation pipeline (introduced in the methods and used for all reported results) is described at a high level but provides no explicit rules for error handling, rejection criteria, or how model outputs are transformed into validated code artifacts; this detail is load-bearing for the determinism, 96% completion, and zero-execution-token claims.

Authors: We agree that the current description is high-level and that explicit details are needed to support the determinism and related claims. In the revised manuscript, we will expand the Methods section with a full specification of the four stages, including error handling rules (e.g., retry limits on validation failures), rejection criteria (e.g., permanent rejection after maximum retries), and the exact transformation steps from raw model output to validated artifacts (e.g., parsing, template instantiation, and static verification). We will also add pseudocode and a detailed flowchart. revision: yes
Referee: [Security Evaluation] Security evaluation reports 96.7% prompt-injection detection and 87.5% static-code-safety accuracy with zero false positives across 135 cases, yet neither the construction of the test suite nor the specific static-analysis tools or thresholds are specified; without this, the robustness of the security advantage cannot be assessed.

Authors: We acknowledge the omission of these details. The revised manuscript will include a dedicated subsection describing the construction of the 135 test cases (categories, generation method, and sources), the specific static-analysis tools employed, the rulesets applied, and the exact thresholds used to achieve the reported accuracies with zero false positives. revision: yes
Referee: [Evaluation / Operational Metrics] The amortization analysis (break-even at ~17 transactions, 57x reduction at 1,000) relies on the assumption that the compiled artifacts require no runtime model calls, but the manuscript does not report failure rates or fallback mechanisms when the generated code encounters out-of-distribution inputs; this affects the long-term reliability claim.

Authors: The reported amortization figures are computed exclusively over successfully compiled artifacts (consistent with the 96% BFCL completion rate). We agree that the manuscript should address OOD behavior. In revision we will add a limitations paragraph discussing observed failure modes on the evaluated benchmarks, the conditions under which recompilation would be triggered, and the resulting impact on long-term token and reliability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports direct empirical measurements on external benchmarks (BFCL n=400 for function-calling, DocILE n=5,680 for document intelligence) including 96% task completion, break-even at ~17 transactions, 57x token reduction, and security metrics. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the derivation chain. The architecture, four-stage pipeline, and evaluation framework are presented as independent contributions whose performance claims rest on observable results rather than internal reductions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about LLM code generation and validation rather than new entities or fitted constants.

axioms (2)

domain assumption Constraining LLM generation to narrow business-logic functions in validated templates produces correct, executable code for the target workflows
Invoked in the description of the constrained generation step and the four-stage pipeline.
domain assumption The validation pipeline reliably converts probabilistic model output into production-ready deterministic artifacts
Required for the claim that runtime execution needs zero further model invocations.

pith-pipeline@v0.9.0 · 5704 in / 1319 out tokens · 58404 ms · 2026-05-10T18:51:41.532588+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By constraining generation to narrow business-logic functions embedded in validated templates, compiled AI trades runtime flexibility for predictability, auditability, cost efficiency...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate on two task types: function-calling (BFCL, n=400) and document intelligence (DocILE, n=5,680 invoices). ... breaking even with runtime inference at approximately 17 transactions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 19 canonical work pages · 3 internal anchors

[1]

Matthew Renze

Anthropic. Introducing Claude Opus 4.5. Technical report, November 2025.https://www. anthropic.com/news/claude-opus-4-5 Nurullah Atil, Pedro Henrique Luz de Araujo, and Benjamin Roth. Non-Determinism of “Deter- ministic” LLM Settings.arXiv:2408.04667, August

work page arXiv 2025
[2]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri et al. Why Do Multi-Agent LLM Systems Fail?arXiv:2503.13657, March

work page internal anchor Pith review arXiv
[3]

Astrogator: Towards Formal Verification of LLM-Generated Code

Craig Councilman et al. Astrogator: Towards Formal Verification of LLM-Generated Code. arXiv:2507.13290, July

work page arXiv
[4]

arXiv preprint arXiv:2405.06624 , year =

David Dalrymple et al. Towards Guaranteed Safe AI.arXiv:2405.06624, May

work page arXiv
[5]

Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

Yixin Dong et al. XGrammar.arXiv:2411.15100, November

work page arXiv
[6]

Workflowllm: Enhancing workflow orchestration capability of large language models.arXiv preprint arXiv:2411.05451, 2024

Shuheng Fan et al. WorkflowLLM.arXiv:2411.05451, November

work page arXiv
[7]

AI Copilot Code Quality: 2025 Data

GitClear. AI Copilot Code Quality: 2025 Data. Technical report, February 2025.https://www. gitclear.com/ai_assistant_code_quality_2025_research Sirui Hong et al. MetaGPT. InICLR,

2025
[8]

Spec-driven development.X (Twitter), January 2026.https://x.com/karpathy/ status/1883601522500329783 Omar Khattab et al

Andrej Karpathy. Spec-driven development.X (Twitter), January 2026.https://x.com/karpathy/ status/1883601522500329783 Omar Khattab et al. DSPy.arXiv:2310.03714, October

work page arXiv 2026
[9]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Bo Liu et al. LLM+P.arXiv:2304.11477, April

work page internal anchor Pith review arXiv
[10]

Measuring the Impact of Early-2025 AI Models on Developer Productivity

METR. Measuring the Impact of Early-2025 AI Models on Developer Productivity. Technical report,

2025
[11]

Retrieved 2026.https://openai.com/api/pricing Adarsh Neupane et al

OpenAI (2026).API Pricing. Retrieved 2026.https://openai.com/api/pricing Adarsh Neupane et al. Towards a HIPAA Compliant Agentic AI System.arXiv:2504.17669, April

work page arXiv 2026
[12]

Zhang, Mark Harman, and Meng Wang

Shuyin Ouyang et al. Non-determinism of ChatGPT in Code Generation.arXiv:2308.02828, August

work page arXiv
[13]

Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A

Alex Pan et al. Measuring Agents in Production.arXiv:2512.04123, December

work page arXiv
[14]

Beyond Synthetic Benchmarks.arXiv:2510.26130, October

Shubham Rao et al. Beyond Synthetic Benchmarks.arXiv:2510.26130, October

work page arXiv
[15]

Code generation with AlphaCodium : From prompt engineering to flow engineering

Tal Ridnik et al. AlphaCodium.arXiv:2401.08500, January

work page arXiv
[16]

Code Llama: Open Foundation Models for Code

Baptiste Rozière et al. Code Llama.arXiv:2308.12950, August

work page internal anchor Pith review Pith/arXiv arXiv
[17]

The swe-bench illusion: When state-of-the-art llms remember instead of reason, 2025

Yizhou Shi et al. The SWE-Bench Illusion.arXiv:2506.12286, June

work page arXiv
[18]

SynCode: LLM Generation with Grammar Augmentation,

Shubham Ugare et al. SynCode.arXiv:2403.01632, March

work page arXiv
[19]

2025 GenAI Code Security Report

Veracode. 2025 GenAI Code Security Report. Technical report, September

2025
[20]

Benchmark data contamina- tion of large language models: A survey, 2024

Yucheng Xu et al. Benchmark Data Contamination Survey.arXiv:2406.04244, June

work page arXiv
[21]

& Kankanhalli, M

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is Inevitable.arXiv:2401.11817,

work page arXiv
[22]

Gonzalez, and Ion Stoica

Yichen Yang et al. Rethinking Benchmark and Contamination.arXiv:2311.04850, November

work page arXiv
[23]

extract_clinical_data_activity

AND has_step_therapy_failure THEN APPROVE ELSE DENY Listing 2: Compiled artifact isolating probabilistic extraction from deterministic decisions @workflow.defn class PriorAuthWorkflow: @workflow.run async def run(self, input_data: WorkflowInput) -> AuthResult: # PHASE 1: BOUNDED AGENTIC I N V O C A T I O N clinical_data = await workflow.execute_activity( ...

2024