Toward Secure and Reliable PDDL Formalization of Large Language Models with Planner-in-the-Loop Feedback

Daniel Zeng; Feifei Mo; Jiajing Zhang; Jiamei Jiang; Linjing Li

arxiv: 2606.29700 · v1 · pith:BC6D7F6Jnew · submitted 2026-06-29 · 💻 cs.AI

Toward Secure and Reliable PDDL Formalization of Large Language Models with Planner-in-the-Loop Feedback

Jiamei Jiang , Jiajing Zhang , Feifei Mo , Linjing Li , Daniel Zeng This is my paper

Pith reviewed 2026-06-30 06:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords PDDL formalizationlarge language modelsplanner feedbackNL-PDDL-Benchplan executabilitypreference optimizationsafety-critical planning

0 comments

The pith

Integrating planner feedback during training and revision enables LLMs to generate more reliable PDDL specifications from natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a benchmark called NL-PDDL-Bench for turning natural language descriptions into PDDL specifications that planners can execute and verify. It proposes a planner-in-the-loop framework that uses error signals from the planner and validator to make targeted fixes to faulty specifications. The approach trains models with supervised fine-tuning and preference optimization drawn from planner results, then applies repairs at inference time without needing live planner access during training. Experiments across model families report higher rates of successful plan generation and greater agreement with reference plans. The work targets safer use of LLMs in systems where planning errors could lead to execution failures or unsafe actions.

Core claim

By combining validator and planner diagnostics for localized revision of non-executable PDDL specifications with a planner-grounded training recipe that uses offline preference pairs, the method produces large language models that achieve substantially higher planner success rates and plan-level consistency, with gains that hold under increasing object counts and across domains.

What carries the argument

The planner-in-the-loop framework, which applies localized edits to PDDL specifications based on diagnostics from the validator and planner.

If this is right

Planner success rates rise on generated specifications across tested model families.
Gains in performance persist as the number of objects in problems increases.
Cross-domain consistency improves without domain-specific changes to the method.
Training proceeds without any online planner calls, relying only on offline data for preference optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diagnostic-driven revision pattern could apply to generating other executable formal languages beyond PDDL.
Reduced reliance on post-generation human checks might become feasible in automated logistics or robotics pipelines.
The benchmark's controlled difficulty scaling by object count offers a template for testing formalization robustness in other symbolic domains.

Load-bearing premise

That planner and validator diagnostics provide sufficient localized signals to revise non-executable PDDL specifications without introducing new errors or requiring domain-specific repair rules.

What would settle it

An experiment on the same models and benchmark where applying the planner-in-the-loop revision and optimization steps produces no increase, or a decrease, in planner success rates and plan agreement relative to standard fine-tuning baselines.

Figures

Figures reproduced from arXiv: 2606.29700 by Daniel Zeng, Feifei Mo, Jiajing Zhang, Jiamei Jiang, Linjing Li.

**Figure 2.** Figure 2: Performance under difficulty scaling. 1) Performance under Difficulty Scaling: To test whether planning reliability is governed by executable constraint [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-domain generalization results. 2) Cross-Domain Generalization: We further study crossdomain generalization to assess whether gains reflect domainspecific memorization or a general improvement in executable alignment.In many domains, syntax validity is already relatively high and changes modestly, whereas planner success and plan-level agreement improve substantially, indicating that the main ben… view at source ↗

read the original abstract

Planning often requires symbolic specifications that are both executable and verifiable. For large language models deployed in autonomous or decision-support systems, failures in such formalization may lead to unverifiable decisions, execution failures, or unsafe downstream behavior. We present NL-PDDL-Bench, a multi-domain benchmark for natural-language-to-PDDL specification construction with planner-verified executability and controlled difficulty scaling by object count. We further propose a planner-in-the-loop framework that uses validator and planner diagnostics to revise non-executable specifications through localized edits. Building on this infrastructure, we develop a planner-grounded optimization recipe that combines parameter-efficient Low-Rank Adaptation supervised fine-tuning, offline planner-derived preference pairs for Direct Preference Optimization, and inference-time planner-in-the-loop repair, without requiring online planner calls during training. We also provide a unified evaluation suite for parseability, solvability, specification similarity, and outcome-aware plan-level consistency against planner references. Experiments on representative model families show substantial gains in planner success and plan-level agreement, with improved robustness under difficulty scaling and cross-domain variation. These results highlight the value of externally verifiable formalization for reliable deployment of LLMs in safety- or security-sensitive planning systems. Code and data are available at: https://github.com/ibasicplan/NL-PDDL-Bench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new NL-to-PDDL benchmark with planner feedback for repairs, but the claimed gains rest on an unshown revision procedure whose reliability is still unclear.

read the letter

The core contribution is NL-PDDL-Bench, a multi-domain set of natural-language problems turned into PDDL with explicit planner verification and difficulty scaled by object count, plus a training-plus-inference recipe that mixes LoRA fine-tuning, offline DPO from planner preference pairs, and inference-time localized edits drawn from validator and planner diagnostics.

The work does a few things cleanly. It releases code and data. It defines a unified evaluation that covers parseability, solvability, specification similarity, and plan-level consistency against planner references. It keeps the training loop free of online planner calls, which is a practical engineering choice. These pieces give readers a concrete starting point for testing LLM formalization in planning domains.

The soft spots sit in the experimental claims and the repair step. The abstract states substantial gains in success rate and robustness under scaling and cross-domain shifts, yet supplies no baseline comparisons, no statistical tests, and no data-split details. That makes it impossible to judge effect size or whether the improvements are real. The central mechanism—using diagnostics for localized edits without new errors or domain-specific rules—is described at a high level but not shown in enough detail to verify it avoids cascading inconsistencies such as predicate mismatches or goal drift. If that step turns out to need hidden domain knowledge or frequent manual tuning, the reported robustness will not hold.

This paper is for groups already working on LLM-to-symbolic pipelines in safety-sensitive settings who need a benchmark and a repair pattern to build on. A reader who wants to run controlled tests on formalization quality will find usable artifacts here.

It deserves peer review. The benchmark and the combined training-inference approach are concrete enough to merit referee time, even though the current write-up leaves the strength of the empirical results and the edit procedure open to direct questions.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces NL-PDDL-Bench, a multi-domain benchmark for natural-language-to-PDDL specification with planner-verified executability and difficulty scaling; proposes a planner-in-the-loop framework that uses validator/planner diagnostics for localized revision of non-executable specifications; develops a planner-grounded training recipe combining LoRA SFT, offline DPO on planner-derived preferences, and inference-time repair; and reports experimental gains in planner success, plan-level agreement, and robustness under scaling and cross-domain variation on representative model families.

Significance. If the empirical claims hold, the work would strengthen the case for externally verifiable formalization in LLM planning systems, with potential value for safety-critical applications. The public release of code and data at the cited GitHub repository is a clear strength for reproducibility and follow-on work.

major comments (2)

[Abstract] Abstract: the headline claim of 'substantial gains in planner success and plan-level agreement, with improved robustness under difficulty scaling and cross-domain variation' supplies no baselines, metrics, statistical tests, data splits, or quantitative effect sizes, rendering it impossible to assess whether the reported improvements are supported by the experiments.
[Abstract] Abstract: the planner-in-the-loop revision step is described only as using 'validator and planner diagnostics to revise non-executable specifications through localized edits' with no account of the edit-generation procedure (prompt-driven or otherwise), safeguards against cascading errors, or how domain-specific repair rules are avoided; this mechanism is load-bearing for all robustness and generalization claims.

minor comments (1)

[Abstract] The abstract introduces several evaluation notions (parseability, solvability, specification similarity, outcome-aware plan-level consistency) without brief definitions or references to their precise formulations in the evaluation suite.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the abstract for greater specificity while preserving its concise nature.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 'substantial gains in planner success and plan-level agreement, with improved robustness under difficulty scaling and cross-domain variation' supplies no baselines, metrics, statistical tests, data splits, or quantitative effect sizes, rendering it impossible to assess whether the reported improvements are supported by the experiments.

Authors: The abstract is a high-level summary; full details appear in Sections 4–5, including baselines (vanilla LLMs, SFT-only, DPO-only), metrics (planner success rate, plan-level agreement via normalized edit distance and semantic equivalence), statistical tests (paired t-tests with p<0.01), data splits (per-domain 70/15/15), and effect sizes (absolute gains of 12–28% in success rate, larger under scaling). We will revise the abstract to include one or two representative quantitative highlights and effect-size ranges. revision: yes
Referee: [Abstract] Abstract: the planner-in-the-loop revision step is described only as using 'validator and planner diagnostics to revise non-executable specifications through localized edits' with no account of the edit-generation procedure (prompt-driven or otherwise), safeguards against cascading errors, or how domain-specific repair rules are avoided; this mechanism is load-bearing for all robustness and generalization claims.

Authors: We agree the abstract description is terse. Section 3.3 specifies prompt-driven localized edits generated by the LLM conditioned on validator syntax errors and planner unsolvability messages, with safeguards of at most three repair iterations plus a fallback to the original specification, and general (non-domain-specific) prompts that rely solely on planner feedback rather than hand-crafted rules. We will add a brief clause to the abstract noting the prompt-driven, iteration-limited repair procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: framework relies on external planner feedback and independent verification.

full rationale

The paper describes an empirical pipeline (NL-PDDL-Bench, planner-in-the-loop revision, LoRA + DPO with offline planner-derived pairs, and multi-metric evaluation) whose central claims rest on externally verifiable planner success rates and plan agreement against planner references. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce any result to its own inputs by construction. The planner diagnostics are treated as an independent oracle, satisfying the non-circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5774 in / 942 out tokens · 32755 ms · 2026-06-30T06:39:53.089728+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 4 canonical work pages

[1]

A survey on model repair in ai planning,

P. Bercher, S. Sreedharan, and M. Vallati, “A survey on model repair in ai planning,” inIJCAI, 2025, p. 26

2025
[2]

Abstraction heuristics for classical planning tasks with conditional effects,

M. Pozo and J. Seipp, “Abstraction heuristics for classical planning tasks with conditional effects,” inIJCAI, 2025, pp. 8608–8616

2025
[3]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,

Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang, “A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,”High-Confidence Computing, p. 100211, 2024

2024
[4]

Pddl—the planning domain definition language,

D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, D. Wilkins, A. Barrett, D. Christiansonet al., “Pddl—the planning domain definition language,”Tech. Rep., 1998

1998
[5]

The fast downward planning system,

M. Helmert, “The fast downward planning system,”JAIR, vol. 26, pp. 191–246, 2006

2006
[6]

Val: Automatic plan validation, continuous effects and mixed initiative planning using pddl,

R. Howey, D. Long, and M. Fox, “Val: Automatic plan validation, continuous effects and mixed initiative planning using pddl,” inICTAI, 2004, pp. 294–301

2004
[7]

On the planning abilities of large language models (a critical investigation with a proposed benchmark),

K. Valmeekam, S. Sreedharan, M. Marquez, A. Olmo, and S. Kamb- hampati, “On the planning abilities of large language models (a critical investigation with a proposed benchmark),”arXiv preprint arXiv:2302.06706, 2023

work page arXiv 2023
[8]

Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models,

M. Tantakoun, C. Muise, and X. Zhu, “Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models,” inFindings of ACL, 2025, pp. 25 167–25 188

2025
[9]

Unlocking the planning capa- bilities of large language models with maximum diversity fine-tuning,

W. Li, C. Chen, and P. Varakantham, “Unlocking the planning capa- bilities of large language models with maximum diversity fine-tuning,” inFindings of NAACL, 2025, pp. 3318–3340

2025
[10]

Planning in the dark: Llm- symbolic planning pipeline without experts,

S. Huang, N. Lipovetzky, and T. Cohn, “Planning in the dark: Llm- symbolic planning pipeline without experts,” inAAAI, 2025, pp. 26 542–26 550

2025
[11]

Leveraging environment interaction for automated pddl translation and planning with large language models,

S. Mahdavi, R. Aoki, K. Tang, and Y . Cao, “Leveraging environment interaction for automated pddl translation and planning with large language models,”NeurIPS, vol. 37, pp. 38 960–39 008, 2024

2024
[12]

arXiv preprint arXiv:2405.04215 , year=

E. Gestrin, M. Kuhlmann, and J. Seipp, “Nl2plan: Robust llm- driven planning from minimal text descriptions,”arXiv preprint arXiv:2405.04215, 2024

work page arXiv 2024
[13]

Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change,

K. Valmeekam, M. Marquez, A. Olmo, S. Sreedharan, and S. Kamb- hampati, “Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change,”NeurIPS, vol. 36, pp. 38 975–38 987, 2023

2023
[14]

Acpbench: Reasoning about action, change, and planning,

H. Kokel and M. Katz, “Acpbench: Reasoning about action, change, and planning,” inAAAI, 2025, pp. 26 559–26 568

2025
[15]

Generalized planning in pddl domains with pretrained large language models,

T. Silver, S. Dan, K. Srinivas, J. B. Tenenbaum, L. Kaelbling, and M. Katz, “Generalized planning in pddl domains with pretrained large language models,” inAAAI, 2024, pp. 20 256–20 264

2024
[16]

Problem formulation in planning and design,

R. J. V olkema, “Problem formulation in planning and design,”Man- agement Science, vol. 29, pp. 639–652, 1983

1983
[17]

An introduction to the planning domain definition language (pddl): Book review,

A. E. Gerevini, “An introduction to the planning domain definition language (pddl): Book review,”Artificial Intelligence, vol. 280, p. 103221, 2020

2020
[18]

arXiv preprint arXiv:2311.09830 , year=

K. Stein, D. Fi ˇser, J. Hoffmann, and A. Koller, “Autoplanbench: Automatically generating benchmarks for llm planners from pddl,” arXiv preprint arXiv:2311.09830, 2023

work page arXiv 2023
[19]

Using classical planners for plan verification and counterexample generation,

R. P. Goldman, U. Kuter, and A. Schneider, “Using classical planners for plan verification and counterexample generation,” inAAAI Work- shop on Problem Solving Using Classical Planning, 2012

2012
[20]

PDDL-Instruct: Enhancing symbolic planning capabilities in LLMs through logical chain-of-thought instruction tuning,

P. Verma, N. La, A. Favier, S. Mishra, and J. A. Shah, “PDDL-Instruct: Enhancing symbolic planning capabilities in LLMs through logical chain-of-thought instruction tuning,” inICAPS Workshop on Planning in the Era of LLMs, 2025

2025
[21]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models,” ICLR, 2022

2022
[22]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”NeurIPS, vol. 36, pp. 10 088– 10 115, 2023

2023
[23]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”NeurIPS, vol. 36, pp. 53 728–53 741, 2023

2023
[24]

On the generalization gap in llm planning: Tests and verifier-reward rl,

V . Belcamino, N. Attolino, A. Capitanelli, and F. Mastrogiovanni, “On the generalization gap in llm planning: Tests and verifier-reward rl,” arXiv preprint arXiv:2601.14456, 2026

work page arXiv 2026

[1] [1]

A survey on model repair in ai planning,

P. Bercher, S. Sreedharan, and M. Vallati, “A survey on model repair in ai planning,” inIJCAI, 2025, p. 26

2025

[2] [2]

Abstraction heuristics for classical planning tasks with conditional effects,

M. Pozo and J. Seipp, “Abstraction heuristics for classical planning tasks with conditional effects,” inIJCAI, 2025, pp. 8608–8616

2025

[3] [3]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,

Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang, “A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,”High-Confidence Computing, p. 100211, 2024

2024

[4] [4]

Pddl—the planning domain definition language,

D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, D. Wilkins, A. Barrett, D. Christiansonet al., “Pddl—the planning domain definition language,”Tech. Rep., 1998

1998

[5] [5]

The fast downward planning system,

M. Helmert, “The fast downward planning system,”JAIR, vol. 26, pp. 191–246, 2006

2006

[6] [6]

Val: Automatic plan validation, continuous effects and mixed initiative planning using pddl,

R. Howey, D. Long, and M. Fox, “Val: Automatic plan validation, continuous effects and mixed initiative planning using pddl,” inICTAI, 2004, pp. 294–301

2004

[7] [7]

On the planning abilities of large language models (a critical investigation with a proposed benchmark),

K. Valmeekam, S. Sreedharan, M. Marquez, A. Olmo, and S. Kamb- hampati, “On the planning abilities of large language models (a critical investigation with a proposed benchmark),”arXiv preprint arXiv:2302.06706, 2023

work page arXiv 2023

[8] [8]

Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models,

M. Tantakoun, C. Muise, and X. Zhu, “Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models,” inFindings of ACL, 2025, pp. 25 167–25 188

2025

[9] [9]

Unlocking the planning capa- bilities of large language models with maximum diversity fine-tuning,

W. Li, C. Chen, and P. Varakantham, “Unlocking the planning capa- bilities of large language models with maximum diversity fine-tuning,” inFindings of NAACL, 2025, pp. 3318–3340

2025

[10] [10]

Planning in the dark: Llm- symbolic planning pipeline without experts,

S. Huang, N. Lipovetzky, and T. Cohn, “Planning in the dark: Llm- symbolic planning pipeline without experts,” inAAAI, 2025, pp. 26 542–26 550

2025

[11] [11]

Leveraging environment interaction for automated pddl translation and planning with large language models,

S. Mahdavi, R. Aoki, K. Tang, and Y . Cao, “Leveraging environment interaction for automated pddl translation and planning with large language models,”NeurIPS, vol. 37, pp. 38 960–39 008, 2024

2024

[12] [12]

arXiv preprint arXiv:2405.04215 , year=

E. Gestrin, M. Kuhlmann, and J. Seipp, “Nl2plan: Robust llm- driven planning from minimal text descriptions,”arXiv preprint arXiv:2405.04215, 2024

work page arXiv 2024

[13] [13]

Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change,

K. Valmeekam, M. Marquez, A. Olmo, S. Sreedharan, and S. Kamb- hampati, “Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change,”NeurIPS, vol. 36, pp. 38 975–38 987, 2023

2023

[14] [14]

Acpbench: Reasoning about action, change, and planning,

H. Kokel and M. Katz, “Acpbench: Reasoning about action, change, and planning,” inAAAI, 2025, pp. 26 559–26 568

2025

[15] [15]

Generalized planning in pddl domains with pretrained large language models,

T. Silver, S. Dan, K. Srinivas, J. B. Tenenbaum, L. Kaelbling, and M. Katz, “Generalized planning in pddl domains with pretrained large language models,” inAAAI, 2024, pp. 20 256–20 264

2024

[16] [16]

Problem formulation in planning and design,

R. J. V olkema, “Problem formulation in planning and design,”Man- agement Science, vol. 29, pp. 639–652, 1983

1983

[17] [17]

An introduction to the planning domain definition language (pddl): Book review,

A. E. Gerevini, “An introduction to the planning domain definition language (pddl): Book review,”Artificial Intelligence, vol. 280, p. 103221, 2020

2020

[18] [18]

arXiv preprint arXiv:2311.09830 , year=

K. Stein, D. Fi ˇser, J. Hoffmann, and A. Koller, “Autoplanbench: Automatically generating benchmarks for llm planners from pddl,” arXiv preprint arXiv:2311.09830, 2023

work page arXiv 2023

[19] [19]

Using classical planners for plan verification and counterexample generation,

R. P. Goldman, U. Kuter, and A. Schneider, “Using classical planners for plan verification and counterexample generation,” inAAAI Work- shop on Problem Solving Using Classical Planning, 2012

2012

[20] [20]

PDDL-Instruct: Enhancing symbolic planning capabilities in LLMs through logical chain-of-thought instruction tuning,

P. Verma, N. La, A. Favier, S. Mishra, and J. A. Shah, “PDDL-Instruct: Enhancing symbolic planning capabilities in LLMs through logical chain-of-thought instruction tuning,” inICAPS Workshop on Planning in the Era of LLMs, 2025

2025

[21] [21]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models,” ICLR, 2022

2022

[22] [22]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”NeurIPS, vol. 36, pp. 10 088– 10 115, 2023

2023

[23] [23]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”NeurIPS, vol. 36, pp. 53 728–53 741, 2023

2023

[24] [24]

On the generalization gap in llm planning: Tests and verifier-reward rl,

V . Belcamino, N. Attolino, A. Capitanelli, and F. Mastrogiovanni, “On the generalization gap in llm planning: Tests and verifier-reward rl,” arXiv preprint arXiv:2601.14456, 2026

work page arXiv 2026