arxiv: 2604.13092 · v1 · submitted 2026-04-08 · 💻 cs.SE

Recognition: no theorem link

PlanCompiler: A Deterministic Compilation Architecture for Structured Multi-Step LLM Pipelines

Pranav Harikumar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:40 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM pipelinesdeterministic compilationstructured workflowsplan validationcode generationmulti-step tasksSQL persistencetype constraints

0 comments

The pith

PlanCompiler separates planning from execution in LLM pipelines using a typed registry and static validation to produce reliable compiled code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PlanCompiler to address brittleness in multi-step LLM workflows where errors compound across transformations and stateful operations. It separates the planning phase from execution by generating a typed JSON plan based on a fixed registry of primitives, then validates that plan against structural and type constraints before compiling it into executable Python. Evaluation on a 300-task benchmark covering workflow depth, SQL persistence, and schema stress tests shows the method outperforming direct free-form code generation. This matters because it offers a structured way to make sequential LLM applications more dependable by shifting from runtime autoregressive chaining to deterministic compilation after validation.

Core claim

PlanCompiler produces a typed JSON plan over a fixed registry of primitives, validates the plan against explicit structural and type constraints, and compiles only validated plans into executable Python, leading to higher first-pass success rates compared to autoregressive code generation baselines across the evaluated tasks.

What carries the argument

The typed node registry with static graph validation and deterministic compilation, which enforces structural and type constraints on the plan before any execution occurs.

Load-bearing premise

The fixed registry of primitives is expressive enough to cover all target workflows without requiring operations outside the registry.

What would settle it

A workflow task that requires a primitive or operation absent from the registry and that cannot be expressed through composition of available primitives, causing plan generation to fail.

Figures

Figures reproduced from arXiv: 2604.13092 by Pranav Harikumar.

read the original abstract

Large language models (LLMs) remain brittle in multi-step structured workflows, where errors compound across sequential transformations, validation stages, and stateful operations such as SQL persistence. We present PlanCompiler, a compilation architecture for structured LLM pipelines that separates planning from execution through a typed node registry, static graph validation, and deterministic compilation. Instead of relying on autoregressive chaining at runtime, the system first produces a typed JSON plan over a fixed registry of primitives, validates that plan against explicit structural and type constraints, and compiles only validated plans into executable Python. We evaluate the approach on a 300-task benchmark covering increasing workflow depth, SQL roundtrip persistence, and schema-themed stress tests. In this setting, PlanCompiler achieves 100% first-pass success on Sets A and B, 88% on Set C, 96% on Set D, 88% on schema-trap tasks, and 84% on SQL roundtrip tasks, outperforming direct free-form code-generation baselines from GPT-4.1 and Claude Sonnet on five of six benchmark sets and achieving 278/300 successes overall versus 202/300 and 187/300 for the two baselines, respectively. Across the full suite, planning cost is approximately \$0.356, compared with \$2.140 for GPT-4.1 and \$18.391 for Claude, while maintaining competitive end-to-end latency. These results suggest that, for registry-constrained structured data workflows, deterministic compilation can improve first-pass reliability and cost efficiency relative to free-form code generation. Residual failures are concentrated in two narrow classes: late output-contract errors on aggregation tasks and early type mismatches at the SQLite persistence boundary, clarifying both the benefits and the current limits of the approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlanCompiler shows a typed registry plus static validation can raise first-pass success in constrained LLM workflows to 278/300 on their benchmark while cutting planning cost, but the gains may be tied to how well the tasks match the registry.

read the letter

PlanCompiler's main move is to have the LLM output a JSON plan using only a fixed set of typed primitives, run static type and structural checks on the resulting graph, and compile only the valid plans into Python. This splits planning from execution in a way that avoids runtime chaining errors for registry-constrained tasks like data pipelines or SQL roundtrips. The architecture is the clearest new element here, and it is presented as distinct from prior agent or chaining methods. The paper does well on the measurements. It reports 100% success on the first two sets, 88-96% on the others, and 278/300 overall against 202 and 187 for the GPT-4.1 and Claude baselines, plus planning costs around $0.36 versus $2+ and $18. The failure modes are narrowed to two specific classes, which helps readers see the remaining limits. The comparisons are direct and the numbers are given per set, so the empirical claim is easy to evaluate on its own terms. The soft spot is the evaluation setup. The stress-test point lands: if the 300 tasks were built by first listing registry operations and then writing tasks that use only those, the comparison tests constraint adherence more than the compilation idea itself. The abstract frames everything as registry-constrained, but without more on how the benchmark was constructed or how one would extend the registry for new operations, it is hard to judge how far the reliability lift travels outside this closed world. This paper is for engineers who already work inside a fixed primitive set for structured workflows and want lower error compounding plus lower cost. Readers building data or SQL agents will see a usable pattern and concrete trade-offs. It deserves peer review because the core idea is a reasonable engineering response with measurable results that the community can test and extend, even if the authors need to clarify benchmark construction and registry coverage in revisions.

Referee Report

2 major / 1 minor

Summary. The manuscript presents PlanCompiler, a deterministic compilation architecture for structured multi-step LLM pipelines. It separates planning from execution using a typed node registry, static graph validation, and compilation to Python. On a 300-task benchmark covering workflow depth, SQL persistence, and schema stress tests, it reports 278/300 first-pass successes (100% on Sets A/B, 88% on C, 96% on D, 88% schema-trap, 84% SQL roundtrip), outperforming GPT-4.1 (202/300) and Claude Sonnet (187/300) baselines at lower planning cost (~$0.356 vs $2.140/$18.391).

Significance. If the benchmark is representative and the registry sufficiently general, the work demonstrates that registry-constrained planning with static validation can deliver substantially higher first-pass reliability and lower cost than free-form code generation for structured data workflows, providing concrete empirical support for deterministic compilation in LLM-based software pipelines.

major comments (2)

[Abstract] Abstract and evaluation section: The headline results (278/300 overall, 100% on Sets A and B) rest on the unstated assumption that every required operation in the 300 tasks is present in the fixed primitive registry and that tasks were not constructed by enumerating registry operations first; without a description of task generation process or registry completeness, the comparison to unconstrained baselines tests constraint adherence rather than the compilation architecture's robustness on arbitrary workflows.
[Evaluation] Evaluation: No ablation is reported that isolates the contribution of the static graph validator versus the typed registry alone, making it impossible to determine whether the performance gains are load-bearing on the validation step or simply on the constraint of using a small fixed set of primitives.

minor comments (1)

[Abstract] The abstract mentions 'residual failures concentrated in two narrow classes' but provides no quantitative breakdown or example traces for the 22 failures, which would aid verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation section: The headline results (278/300 overall, 100% on Sets A and B) rest on the unstated assumption that every required operation in the 300 tasks is present in the fixed primitive registry and that tasks were not constructed by enumerating registry operations first; without a description of task generation process or registry completeness, the comparison to unconstrained baselines tests constraint adherence rather than the compilation architecture's robustness on arbitrary workflows.

Authors: We agree that the manuscript should have explicitly described the task construction process and registry scope. The 300 tasks were generated by enumerating combinations of the 25 primitives in the typed registry (covering data ingestion, transformation, aggregation, SQL persistence, and schema operations) while varying depth, statefulness, and stress conditions; no task requires an operation outside the registry. This design intentionally evaluates the compilation architecture within a registry-constrained setting, which we argue is the relevant regime for reliable structured LLM pipelines. In the revised manuscript we will add a new subsection in Evaluation detailing the registry contents, the task-generation procedure, and an explicit statement of scope. We will also revise the abstract and conclusion to frame the results as demonstrating gains for registry-constrained workflows rather than claiming generality to fully arbitrary code generation. revision: yes
Referee: [Evaluation] Evaluation: No ablation is reported that isolates the contribution of the static graph validator versus the typed registry alone, making it impossible to determine whether the performance gains are load-bearing on the validation step or simply on the constraint of using a small fixed set of primitives.

Authors: This is a fair criticism; the current evaluation does not contain a quantitative ablation that runs the same tasks with the typed registry but without the static validator. The architecture integrates the two components, and we did not execute invalid plans. We can, however, report that 14 of the 22 failures were type or structural violations that the validator is designed to catch before compilation. In revision we will add a dedicated paragraph in the Evaluation section that (a) enumerates the failure modes with examples of validator-rejected plans, (b) provides a qualitative analysis of how validation prevents downstream execution errors, and (c) states the limitation that a controlled ablation was not performed. If space and time permit, we will also run a limited ablation on a 50-task subset to supply quantitative numbers; otherwise the discussion will clearly note the absence of such data. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on benchmark with no self-referential derivations or fitted predictions

full rationale

The paper presents a systems architecture (typed node registry, static validation, deterministic compilation) and reports direct empirical success rates (278/300 overall, 100% on sets A/B, etc.) against external baselines (GPT-4.1, Claude). No equations, mathematical derivations, or 'predictions' appear in the provided text. No self-citations, ansatzes, or uniqueness theorems are invoked. The benchmark results are presented as measured outcomes rather than quantities forced by construction from fitted parameters or registry definitions. The central claims therefore rest on external comparison and do not reduce to the inputs by the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The performance claims rest primarily on domain assumptions about registry coverage and benchmark representativeness rather than fitted parameters or new postulated entities.

axioms (2)

domain assumption The typed node registry is sufficient to express the workflows in the benchmark and target applications.
The entire approach depends on this; tasks outside the registry cannot be planned or validated.
domain assumption The 300-task benchmark with its depth, SQL, and schema stress tests is representative of practical structured LLM pipelines.
Success rates are only meaningful if the test distribution matches real usage.

invented entities (1)

Typed node registry and static graph validator no independent evidence
purpose: To enable deterministic compilation and early error detection in LLM pipelines
Core of the proposed system; no independent evidence outside the paper's own benchmark is provided.

pith-pipeline@v0.9.0 · 5617 in / 1433 out tokens · 44182 ms · 2026-05-10T18:40:01.037162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 7 canonical work pages

[1]

Langchain, 2026

LangChain, Inc. Langchain, 2026. URLhttps://www.langchain.com/. Official project website, accessed 2026-03-19

2026
[2]

Llamaindex, 2026

LlamaIndex. Llamaindex, 2026. URLhttps://developers.llamaindex.ai/python/framework/. Official documentation, accessed 2026-03-19

2026
[3]

smolagents documentation, 2026

Hugging Face. smolagents documentation, 2026. URLhttps://huggingface.co/docs/ smolagents/index. Official documentation, accessed 2026-03-19

2026
[4]

Mahoney, Kurt Keutzer, and Amir Gholami

Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. An llm compiler for parallel function calling.arXiv preprint arXiv:2312.04511,

work page arXiv
[5]

Mahoney, Kurt Keutzer, and Amir Gholami

doi: 10.48550/arXiv.2312.04511. URLhttps://arxiv.org/abs/2312.04511

work page doi:10.48550/arxiv.2312.04511
[6]

Orchdag: Complex tool orchestration in multi-turn interac- tions with plan dags.arXiv preprint arXiv:2510.24663, 2025

Yifu Lu, Shengjie Liu, and Li Dong. Orchdag: Complex tool orchestration in multi-turn interac- tions with plan dags.arXiv preprint arXiv:2510.24663, 2025. URLhttps://arxiv.org/abs/ 2510.24663

work page arXiv 2025
[7]

Text-to-pipeline: Bridging natural language and data preparation pipelines.arXiv preprint arXiv:2505.15874, 2025

Yuhang Ge, Yachuan Liu, Zhangyan Ye, Yuren Mao, and Yunjun Gao. Text-to-pipeline: Bridging natural language and data preparation pipelines.arXiv preprint arXiv:2505.15874, 2025. doi: 10.48550/arXiv.2505.15874. URLhttps://arxiv.org/abs/2505.15874

work page doi:10.48550/arxiv.2505.15874 2025
[8]

Autopandas: Neural- backed generators for program synthesis.Proceedings of the ACM on Programming Languages, 3 (OOPSLA):168:1–168:27, 2019

Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. Autopandas: Neural- backed generators for program synthesis.Proceedings of the ACM on Programming Languages, 3 (OOPSLA):168:1–168:27, 2019. doi: 10.1145/3360594. URLhttps://dl.acm.org/doi/10.1145/ 3360594

work page doi:10.1145/3360594 2019
[9]

Data structures for statistical computing in python

Wes McKinney. Data structures for statistical computing in python. InProceedings of the 9th Python in Science Conference, pages 56–61, 2010

2010
[10]

Rajamani, and Rahul Sharma

Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sri- ram K. Rajamani, and Rahul Sharma. Jigsaw: Large language models meet program synthesis. InProceedings of the 44th International Conference on Software Engineering (ICSE), 2022. URL https://arxiv.org/abs/2112.02969

work page arXiv 2022
[11]

Read a CSV file, filter rows where salary exceeds 40000, store in SQLite, query the result, and export to CSV

Zahra Moslemi, Keerthi Koneru, Yen-Ting Lee, Sheethal Kumar, and Ramesh Radhakrishnan. Polaris: Typed planning and governed execution for agentic ai in back-office automation.arXiv preprint arXiv:2601.11816, 2026. doi: 10.48550/arXiv.2601.11816. URLhttps://arxiv.org/ abs/2601.11816. A Full Node Registry Table 7 provides the complete node registry with all...

work page doi:10.48550/arxiv.2601.11816 2026