LLM-guided evolutionary search yields the first domain-independent C++ planning heuristics that exceed the strongest hand-engineered baselines on coverage and speed trade-offs across unseen domains.
hub Canonical reference
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
Canonical reference. 92% of citing Pith papers cite this work as background.
abstract
Large language models (LLMs) have demonstrated remarkable zero-shot generalization abilities: state-of-the-art chatbots can provide plausible answers to many common questions that arise in daily life. However, so far, LLMs cannot reliably solve long-horizon planning problems. By contrast, classical planners, once a problem is given in a formatted way, can use efficient search algorithms to quickly identify correct, or even optimal, plans. In an effort to get the best of both worlds, this paper introduces LLM+P, the first framework that incorporates the strengths of classical planners into LLMs. LLM+P takes in a natural language description of a planning problem, then returns a correct (or optimal) plan for solving that problem in natural language. LLM+P does so by first converting the language description into a file written in the planning domain definition language (PDDL), then leveraging classical planners to quickly find a solution, and then translating the found solution back into natural language. Along with LLM+P, we define a diverse set of different benchmark problems taken from common planning scenarios. Via a comprehensive set of experiments on these benchmark problems, we find that LLM+P is able to provide optimal solutions for most problems, while LLMs fail to provide even feasible plans for most problems.\footnote{The code and results are publicly available at https://github.com/Cranial-XIX/llm-pddl.git.
hub tools
citation-role summary
citation-polarity summary
roles
background 11representative citing papers
Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Self-improvement of a decoder-only transformer yields plans averaging 30% shorter than a source symbolic planner, over 80% optimal where known, with sub-exponential latency scaling.
LLM-Flax automates neuro-symbolic robotic task planning with three LLM stages for rule generation, failure recovery, and zero-shot scoring, outperforming manual baselines on MazeNamo grids.
ANCHOR raises mobile manipulation success from 53.3% to 71.7% in unseen homes by binding plans to observable geometry, ensuring operable navigation endpoints, and using layered local recovery instead of global replans.
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA and fact-checking datasets.
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Plan-and-Solve prompting improves zero-shot LLM reasoning by first creating an explicit plan then executing subtasks, outperforming simple 'think step by step' prompts across ten datasets.
Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.
BISON learns bilevel policies over symbolic world models to generalize long-horizon robotic planning beyond VLA and end-to-end baselines while remaining efficient even at 10,000-object scale.
CSR with ASR enables infinite-horizon real-time LLM policies via stable KV-cache properties and background eviction, delivering 26x lower latency and SOTA recall on embodied benchmarks.
Behavior Forest decouples multi-constraint travel planning into parallel behavior trees with LLM nodes and global coordination, yielding 6.67% and 11.82% gains over prior methods on two benchmarks.
COMPASS formalizes prompt engineering as a POMDP-based cognitive decision process for self-adaptive generation of task plan explanations via LLMs.
SYMBOLIZER grounds symbolic states from images via VLMs using only lifted predicates and solves long-horizon tasks with goal-count and width-based heuristic search, outperforming direct VLM planning and matching VLM-heuristic baselines on ProDG and ViPlan benchmarks.
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
KGLAMP uses a dynamically updated knowledge graph to guide LLMs in creating and replanning PDDL specifications for heterogeneous multi-robot teams, reporting at least 25.3% better performance than LLM-only or classical PDDL baselines on the MAT-THOR benchmark.
UniDomain extracts atomic PDDL domains from 12,393 robot videos to create a unified domain of 3137 operators and 2875 predicates, then retrieves and fuses relevant parts to enable zero-shot planning on unseen real-world tasks.
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
citing papers explorer
-
LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning
LLM-guided evolutionary search yields the first domain-independent C++ planning heuristics that exceed the strongest hand-engineered baselines on coverage and speed trade-offs across unseen domains.
-
Property-Guided LLM Program Synthesis for Planning
Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
-
Self-Improvement for Fast, High-Quality Plan Generation
Self-improvement of a decoder-only transformer yields plans averaging 30% shorter than a source symbolic planner, over 80% optimal where known, with sub-exponential latency scaling.
-
LLM-Flax : Generalizable Robotic Task Planning via Neuro-Symbolic Approaches with Large Language Models
LLM-Flax automates neuro-symbolic robotic task planning with three LLM stages for rule generation, failure recovery, and zero-shot scoring, outperforming manual baselines on MazeNamo grids.
-
ANCHOR: A Physically Grounded Closed-Loop Framework for Robust Home-Service Mobile Manipulation
ANCHOR raises mobile manipulation success from 53.3% to 71.7% in unseen homes by binding plans to observable geometry, ensuring operable navigation endpoints, and using layered local recovery instead of global replans.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS
Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA and fact-checking datasets.
-
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Plan-and-Solve prompting improves zero-shot LLM reasoning by first creating an explicit plan then executing subtasks, outperforming simple 'think step by step' prompts across ten datasets.
-
Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity
Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.
-
Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
BISON learns bilevel policies over symbolic world models to generalize long-horizon robotic planning beyond VLA and end-to-end baselines while remaining efficient even at 10,000-object scale.
-
CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations
CSR with ASR enables infinite-horizon real-time LLM policies via stable KV-cache properties and background eviction, delivering 26x lower latency and SOTA recall on embodied benchmarks.
-
Decoupled Travel Planning with Behavior Forest
Behavior Forest decouples multi-constraint travel planning into parallel behavior trees with LLM nodes and global coordination, yielding 6.67% and 11.82% gains over prior methods on two benchmarks.
-
Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs
COMPASS formalizes prompt engineering as a POMDP-based cognitive decision process for self-adaptive generation of task plan explanations via LLMs.
-
SYMBOLIZER: Symbolic Model-free Task Planning with VLMs
SYMBOLIZER grounds symbolic states from images via VLMs using only lifted predicates and solves long-horizon tasks with goal-count and width-based heuristic search, outperforming direct VLM planning and matching VLM-heuristic baselines on ProDG and ViPlan benchmarks.
-
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
-
KGLAMP: Knowledge Graph-guided Language model for Adaptive Multi-robot Planning and Replanning
KGLAMP uses a dynamically updated knowledge graph to guide LLMs in creating and replanning PDDL specifications for heterogeneous multi-robot teams, reporting at least 25.3% better performance than LLM-only or classical PDDL baselines on the MAT-THOR benchmark.
-
UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning
UniDomain extracts atomic PDDL domains from 12,393 robot videos to create a unified domain of 3137 operators and 2875 predicates, then retrieves and fuses relevant parts to enable zero-shot planning on unseen real-world tasks.
-
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
A Survey on Vision-Language-Action Models for Embodied AI
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
-
Cognitive Architectures for Language Agents
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
-
Reasoning with Language Model is Planning with World Model
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
-
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
-
Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents
ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.
-
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
-
End-to-end PDDL Planning with Hardcoded and Dynamic Agents
An end-to-end LLM framework refines natural language into valid PDDL domains and problems via hardcoded and dynamic agents, generates plans with standard engines, and returns readable output.
-
Adaptive Obstacle-Aware Task Assignment and Planning for Heterogeneous Robot Teaming
OATH combines adaptive Halton sampling, obstacle-aware clustering with auctions, and LLM-based instruction interpretation to improve task assignment and planning for heterogeneous robot teams in obstacle-rich environments.
-
LLM-Guided Task- and Affordance-Level Exploration in Reinforcement Learning
LLM-TALE steers RL exploration using LLM-generated plans at task and affordance levels with online suboptimality correction, improving sample efficiency and success rates on pick-and-place tasks without human supervision.
-
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
STPR uses LLMs to generate Python constraint functions from natural language instructions, then applies them via traditional search algorithms to point clouds in simulated Gazebo robot environments with reported full compliance.
-
Iterative Formalization and Planning in Partially Observable Environments
PDDLego iteratively formalizes and refines PDDL representations of partially observable environments to improve planning success without finetuning or in-context examples.
-
Efficient Test-time Inference for Generative Planning Models with OCL Search
Modified OCL search integrates generative rollouts and learned heuristics for efficient inference in planning models across combinatorial domains.
-
AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System
AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.
-
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
-
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
- LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning
- Verbalized Algorithms: Classical Algorithms are All You Need (Mostly)