arxiv: 2604.21354 · v1 · submitted 2026-04-23 · 💻 cs.LG

Recognition: unknown

Decoupled Travel Planning with Behavior Forest

Duanyang Yuan , Sihang Zhou , Yanning Hou , Xiaoshu Chen , Haoyuan Chen , Ke Liang , Jiyuan Liu , Chuan Ma

show 2 more authors

Xinwang Liu Jian Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords Behavior Foresttravel planninglarge language modelsmulti-constraint planningbehavior treesdecoupled reasoningLLM planning

0 comments

The pith

Structuring travel planning as a forest of behavior trees lets LLMs reason on local subtasks while coordinating globally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that multi-constraint travel planning fails when a single model must jointly handle local rules inside each subtask and global rules that cross subtasks. It proposes the Behavior Forest as an explicit structure of parallel behavior trees, one per subtask, each containing LLM nodes that generate candidate steps conditioned only on the subtask's constraints plus feedback from a separate coordination layer. This separation is claimed to lower the reasoning load on the language model and produce more coherent full plans. The method is tested on two travel-planning benchmarks where it records higher success rates than prior single-space approaches.

Core claim

Behavior sequences form the foundation of travel planning, yet each step is shaped by both local constraints inside its subtask and global constraints that span many subtasks. Prior methods collapse all constraints into one decision space, forcing the model to reason over the full entanglement at every step. The Behavior Forest instead builds a collection of parallel behavior trees, assigns each tree to one subtask, embeds large language models inside the tree nodes to produce subplans from local constraints alone, and adds an explicit global coordination mechanism that passes feedback between trees. The trees supply control structure while the LLMs supply generation, decoupling the problem.

What carries the argument

The Behavior Forest: a collection of parallel behavior trees, each responsible for one subtask, with large language models at the nodes performing localized reasoning and a global coordination mechanism that orchestrates interactions across trees.

If this is right

Each subtask can be planned with constraints that act only inside its own tree, avoiding the need to track distant global rules at every step.
Coordination feedback lets one tree revise an earlier decision when later trees reveal conflicts, without restarting the entire plan.
The explicit tree structure supplies a readable trace of how local decisions combine into a global itinerary.
LLM generation is guided at each node by the tree's local context rather than by a single undifferentiated prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same forest structure could be applied to other multi-stage planning domains such as project scheduling or supply-chain routing where local and global constraints also interact.
If coordination feedback is made richer, the method might support incremental replanning when new constraints appear mid-execution.
Embedding smaller specialized models inside individual trees rather than a single large model might further reduce per-step cost while preserving the decoupling benefit.

Load-bearing premise

The global coordination mechanism can reliably resolve interactions among subtasks without adding overhead that offsets the benefits of local decoupling, and that LLMs embedded in tree nodes can generate coherent subplans conditioned only on task-specific constraints plus coordination feedback.

What would settle it

A controlled ablation that disables the coordination layer entirely while keeping the same forest structure and LLM nodes, then measures whether plan success rate on TravelPlanner or ChinaTravel falls back to the level of entangled baselines.

Figures

Figures reproduced from arXiv: 2604.21354 by Chuan Ma, Duanyang Yuan, Haoyuan Chen, Jian Huang, Jiyuan Liu, Ke Liang, Sihang Zhou, Xiaoshu Chen, Xinwang Liu, Yanning Hou.

**Figure 2.** Figure 2: Overall framework of Behavior Forest. Stage 1 extracts constraint information from natural-language queries using LLMs, such as dates, budgets, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Violations for different constraint types across three methods (Behavior [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Violations for different constraint types across three methods (Behavior [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Violations for different constraint types across three methods (Behavior [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Case study comparing Behavior Forest with three representative baselines: ReAct, CoT, and EvoAgent. ReAct suffers from inconsistency between [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Behavior sequences, composed of executable steps, serve as the operational foundation for multi-constraint planning problems such as travel planning. In such tasks, each planning step is not only constrained locally but also influenced by global constraints spanning multiple subtasks, leading to a tightly coupled and complex decision process. Existing travel planning methods typically rely on a single decision space that entangles all subtasks and constraints, failing to distinguish between locally acting constraints within a subtask and global constraints that span multiple subtasks. Consequently, the model is forced to jointly reason over local and global constraints at each decision step, increasing the reasoning burden and reducing planning efficiency. To address this problem, we propose the Behavior Forest method. Specifically, our approach structures the decision-making process into a forest of parallel behavior trees, where each behavior tree is responsible for a subtask. A global coordination mechanism is introduced to orchestrate the interactions among these trees, enabling modular and coherent travel planning. Within this framework, large language models are embedded as decision engines within behavior tree nodes, performing localized reasoning conditioned on task-specific constraints to generate candidate subplans and adapt decisions based on coordination feedback. The behavior trees, in turn, provide an explicit control structure that guides LLM generation. This design decouples complex tasks and constraints into manageable subspaces, enabling task-specific reasoning and reducing the cognitive load of LLM. Experimental results show that our method outperforms state-of-the-art methods by 6.67% on the TravelPlanner and by 11.82% on the ChinaTravel benchmarks, demonstrating its effectiveness in increasing LLM performance for complex multi-constraint travel planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper splits travel planning into parallel behavior trees with local LLMs and a global coordinator to decouple constraints, but the coordination details stay thin and the benchmark gains are modest.

read the letter

The main point is that this work structures travel planning as a forest of behavior trees, each tree owning one subtask and using an embedded LLM for local decisions while a separate global coordination step feeds back cross-tree information. The goal is to stop LLMs from having to juggle every local and global constraint in one shot. That framing is new in its specific combination of parallel trees, per-subtask LLM nodes, and explicit coordination for this domain, rather than just another single-tree or flat prompt method. The paper lays out the problem of entangled decision spaces clearly and shows how the forest structure gives an explicit control flow that guides what the LLMs generate. The reported lifts of 6.67% on TravelPlanner and 11.82% on ChinaTravel are concrete numbers, and if the baselines are standard and the extra inference cost is controlled, they indicate the split helps in practice. The design also keeps the trees as a readable control structure, which is a practical plus for debugging agent behavior. The soft spot is the global coordinator. The abstract describes it as orchestrating interactions and supplying feedback, yet gives no algorithm, conflict rule, or pseudocode for how it actually resolves clashes between subtasks without either re-entangling everything or burning extra LLM calls. Without that, it is hard to judge whether the claimed reduction in cognitive load is real or just shifted. The experimental summary also omits baseline details, metrics, and any mention of statistical checks, so the size of the gains could depend on unstated factors. This is aimed at people building LLM agents for multi-constraint planning problems such as scheduling or logistics. A reader already working on modular agent designs or hierarchical planning could pull the high-level split idea and try it, even if they have to fill in the coordinator themselves. It deserves peer review. The core separation of concerns is coherent and the empirical claims are testable, so referees can ask for the missing coordination mechanics and tighter experimental controls.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Behavior Forest approach for travel planning tasks. It decouples the planning into a forest of parallel behavior trees, each managing a subtask with local constraints, while a global coordination mechanism handles interactions across trees. LLMs serve as decision engines in the tree nodes, generating subplans based on task-specific constraints and coordination feedback. The paper claims this reduces the reasoning burden on LLMs and reports empirical improvements of 6.67% on the TravelPlanner benchmark and 11.82% on the ChinaTravel benchmark over state-of-the-art methods.

Significance. If the results hold and the gains are due to the proposed decoupling rather than extraneous factors, the Behavior Forest method could provide a valuable framework for enhancing LLM performance in multi-constraint planning problems by offering modular control structures. The explicit use of behavior trees for guiding LLM generation is a promising direction for addressing complex decision processes.

major comments (2)

[Method] The global coordination mechanism is introduced to orchestrate interactions among the parallel behavior trees, but the manuscript provides no algorithm, pseudocode, or detailed rules for how conflicts are resolved or how feedback is integrated to maintain coherence. This is a load-bearing omission for the central claim that the approach enables modular and coherent planning without re-entangling constraints.
[Experiments] The experimental results claim specific percentage improvements (6.67% on TravelPlanner, 11.82% on ChinaTravel) but lack details on the baselines compared against, the precise metrics used to compute these gains, statistical significance, or controls for variables such as total LLM inference budget and prompt variations. Without these, the attribution of gains to the decoupling mechanism cannot be verified.

minor comments (1)

[Abstract] The abstract refers to 'state-of-the-art methods' without naming them, which reduces clarity for readers unfamiliar with the specific benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Method] The global coordination mechanism is introduced to orchestrate interactions among the parallel behavior trees, but the manuscript provides no algorithm, pseudocode, or detailed rules for how conflicts are resolved or how feedback is integrated to maintain coherence. This is a load-bearing omission for the central claim that the approach enables modular and coherent planning without re-entangling constraints.

Authors: We acknowledge that the current description of the global coordination mechanism, while present in Section 3.2, would benefit from greater formalization. The mechanism uses priority-based merging of subplans and iterative feedback loops to resolve conflicts (e.g., date overlaps or budget violations) without re-entangling all constraints. To address this directly, we will add a dedicated algorithm box with pseudocode and explicit rules for conflict resolution and feedback integration in the revised manuscript. revision: yes
Referee: [Experiments] The experimental results claim specific percentage improvements (6.67% on TravelPlanner, 11.82% on ChinaTravel) but lack details on the baselines compared against, the precise metrics used to compute these gains, statistical significance, or controls for variables such as total LLM inference budget and prompt variations. Without these, the attribution of gains to the decoupling mechanism cannot be verified.

Authors: We agree that additional experimental details are necessary for full verifiability. The reported gains are computed using the standard success-rate and constraint-satisfaction metrics from the TravelPlanner and ChinaTravel benchmarks, relative to the strongest prior baselines. In the revision we will include an expanded results table with all baseline scores, paired statistical significance tests across multiple runs, and new ablation experiments that control for total LLM inference budget and prompt variations to better isolate the effect of the Behavior Forest decoupling. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark comparisons without self-referential reductions

full rationale

The paper proposes a Behavior Forest architecture for decoupling travel planning into parallel behavior trees with a global coordinator, embedding LLMs for localized reasoning. No mathematical derivations, equations, or first-principles predictions are present in the provided text. The central claims consist of a descriptive method outline followed by reported empirical gains (6.67% on TravelPlanner, 11.82% on ChinaTravel) against baselines. These are external benchmark results rather than quantities fitted or defined in terms of themselves. No self-citations, ansatzes, or uniqueness theorems are invoked to justify core components, and the coordination mechanism is presented as an explicit design choice without reducing to a tautology or prior self-result. The derivation chain is therefore self-contained as an engineering proposal validated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method builds on standard LLM capabilities and prior behavior-tree concepts without introducing new postulated objects.

pith-pipeline@v0.9.0 · 5612 in / 1060 out tokens · 53836 ms · 2026-05-09T22:48:54.945582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Travelplanner: a benchmark for real-world planning with language agents,

J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y . Tian, Y . Xiao, and Y . Su, “Travelplanner: a benchmark for real-world planning with language agents,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 54 590–54 613

2024
[2]

A constraint-based method for solving sequential manipulation planning problems,

T. Lozano-P ´erez and L. P. Kaelbling, “A constraint-based method for solving sequential manipulation planning problems,” in2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014, pp. 3684–3691

2014
[3]

Planning as satisfiability: Heuristics,

J. Rintanen, “Planning as satisfiability: Heuristics,”Artificial intelli- gence, vol. 193, pp. 45–86, 2012

2012
[4]

Incremental task and motion planning: A constraint-based approach

N. T. Dantam, Z. K. Kingston, S. Chaudhuri, and L. E. Kavraki, “Incremental task and motion planning: A constraint-based approach.” inRobotics: Science and systems, vol. 12. Ann Arbor, MI, USA, 2016, p. 00052

2016
[5]

Z3: An efficient smt solver,

L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” inInter- national conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2008, pp. 337–340

2008
[6]

A fast linear-arithmetic solver for dpll (t),

B. Dutertre and L. De Moura, “A fast linear-arithmetic solver for dpll (t),” inInternational Conference on Computer Aided Verification. Springer, 2006, pp. 81–94

2006
[7]

Large language models can solve real-world planning rigorously with formal verification tools,

Y . Hao, Y . Chen, Y . Zhang, and C. Fan, “Large language models can solve real-world planning rigorously with formal verification tools,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 3434–3483

2025
[8]

Planning with multi-constraints via collaborative language agents,

C. Zhang, X. D. Goh, D. Li, H. Zhang, and Y . Liu, “Planning with multi-constraints via collaborative language agents,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 10 054–10 082

2025
[9]

Decompose, plan in parallel, and merge: A novel paradigm for large language models based planning with multiple constraints.arXiv preprint arXiv:2506.02683,

Z. Lu, W. Lu, Y . Tao, Y . Dai, Z. Chen, H. Zhuang, C. Chen, H. Peng, and Z. Zeng, “Decompose, plan in parallel, and merge: A novel paradigm for large language models based planning with multiple constraints,” arXiv preprint arXiv:2506.02683, 2025

work page arXiv 2025
[10]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inThe 61st Annual Meeting Of The Association For Computational Linguistics, 2023

2023
[11]

Hypertree planning: Enhancing llm reasoning via hierarchical thinking,

R. Gui, Z. Wang, J. Wang, C. Ma, H. Zhen, M. Yuan, J. HAO, D. Lian, E. Chen, and F. Wu, “Hypertree planning: Enhancing llm reasoning via hierarchical thinking,” inForty-second International Conference on Machine Learning
[12]

arXiv preprint arXiv:2506.15451 , year =

Z. Gu, X. Zhu, Y . Cai, H. Shen, X. Chen, Q. Wang, J. Li, X. Shi, H. Guo, W. Huanget al., “Agentgroupchat-v2: Divide-and-conquer is what llm- based multi-agent system need,”arXiv preprint arXiv:2506.15451, 2025

work page arXiv 2025
[13]

Closed-loop long-horizon robotic planning via equi- librium sequence modeling,

J. Li, Z. Sunet al., “Closed-loop long-horizon robotic planning via equi- librium sequence modeling,” inForty-second International Conference on Machine Learning
[14]

Robust planning with llm-modulo framework: Case study in travel planning,

A. Gundawar, M. Verma, L. Guan, K. Valmeekam, S. Bhambri, and S. Kambhampati, “Robust planning with llm-modulo framework: Case study in travel planning,”arXiv preprint arXiv:2405.20625, 2024

work page arXiv 2024
[15]

Position: Llms can’t plan, but can help planning in llm-modulo frameworks,

S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. P. Saldyt, and A. B. Murthy, “Position: Llms can’t plan, but can help planning in llm-modulo frameworks,” inForty-first International Conference on Machine Learning, 2024

2024
[16]

Mirror: Multi-agent intra-and inter- reflection for optimized reasoning in tool learning.arXiv preprint arXiv:2505.20670, 2025

Z. Guo, B. Xu, X. Wang, and Z. Mao, “Mirror: Multi-agent intra-and inter-reflection for optimized reasoning in tool learning,”arXiv preprint arXiv:2505.20670, 2025

work page arXiv 2025
[17]

Evoagent: Towards automatic multi-agent generation via evolutionary algorithms,

S. Yuan, K. Song, J. Chen, X. Tan, D. Li, and D. Yang, “Evoagent: Towards automatic multi-agent generation via evolutionary algorithms,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 6192–6217

2025
[18]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

2023
[19]

Reflex- ion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 8634–8652, 2023

2023
[20]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “Llm+ p: Empowering large language models with optimal planning proficiency,”arXiv preprint arXiv:2304.11477, 2023

work page internal anchor Pith review arXiv 2023
[21]

Autotamp: Autoregressive task and motion planning with llms as translators and checkers,

Y . Chen, J. Arkin, C. Dawson, Y . Zhang, N. Roy, and C. Fan, “Autotamp: Autoregressive task and motion planning with llms as translators and checkers,” in2024 IEEE International conference on robotics and automation (ICRA). IEEE, 2024, pp. 6695–6702

2024
[22]

Lgmcts: Language-guided monte-carlo tree search for executable semantic object rearrangement,

H. Chang, K. Gao, K. Boyalakuntla, A. Lee, B. Huang, J. Yu, and A. Boularias, “Lgmcts: Language-guided monte-carlo tree search for executable semantic object rearrangement,” in2024 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 13 607–13 612

2024
[23]

A critical assessment of llms for solving multi-step problems: Preliminary results,

V . Hsiao, M. Fine-Morris, M. Roberts, L. N. Smith, and L. M. Hiatt, “A critical assessment of llms for solving multi-step problems: Preliminary results,” inAAAI 2025 Workshop LM4Plan

2025
[24]

Forest-of-thought: Scaling test-time compute for enhancing llm reasoning,

Z. Bi, K. Han, C. Liu, Y . Tang, and Y . Wang, “Forest-of-thought: Scaling test-time compute for enhancing llm reasoning,” inForty-second International Conference on Machine Learning
[25]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11 809–11 822, 2023

2023
[26]

Tree-of-mixed-thought: Combining fast and slow thinking for multi- hop visual reasoning,

P. Hu, J. Qi, X. Li, H. Li, X. Wang, B. Quan, R. Wang, and Y . Zhou, “Tree-of-mixed-thought: Combining fast and slow thinking for multi- hop visual reasoning,”arXiv preprint arXiv:2308.09658, 2023

work page arXiv 2023
[27]

Putting on the thinking hats: A survey on chain of thought fine-tuning from the perspective of human reasoning mechanism,

X. Chen, S. Zhou, K. Liang, D. Yuan, H. Chen, X. Sun, L. Meng, and X. Liu, “Putting on the thinking hats: A survey on chain of thought fine-tuning from the perspective of human reasoning mechanism,”arXiv preprint arXiv:2510.13170, 2025. ARXIV 14

work page arXiv 2025
[28]

Collm: Integrating collaborative embeddings into large language models for recommendation,

Y . Zhang, F. Feng, J. Zhang, K. Bao, Q. Wang, and X. He, “Collm: Integrating collaborative embeddings into large language models for recommendation,”IEEE Transactions on Knowledge and Data Engi- neering, 2025

2025
[29]

Global optimal travel planning for massive travel queries in road networks,

Y . Xu, L. Li, M. Zhang, Z. Xu, and X. Zhou, “Global optimal travel planning for massive travel queries in road networks,”IEEE Transactions on Knowledge and Data Engineering, 2024

2024
[30]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[31]

To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,

Z. R. Sprague, F. Yin, J. D. Rodriguez, D. Jiang, M. Wadhwa, P. Singhal, X. Zhao, X. Ye, K. Mahowald, and G. Durrett, “To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,” inThe Thirteenth International Conference on Learning Representations
[32]

Toward adaptive reasoning in large language models with thought rollback,

S. Chen and B. Li, “Toward adaptive reasoning in large language models with thought rollback,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 7033–7056

2024
[33]

Large- scale hierarchical causal discovery via weak prior knowledge,

X. Wang, T. Ban, L. Chen, D. Lyu, Q. Zhu, and H. Chen, “Large- scale hierarchical causal discovery via weak prior knowledge,”IEEE Transactions on Knowledge and Data Engineering, 2025

2025
[34]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face,

Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face,”Advances in Neural Information Processing Systems, vol. 36, pp. 38 154–38 180, 2023

2023
[36]

Pal: Program-aided language models,

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “Pal: Program-aided language models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 764–10 799

2023
[37]

Learning hierarchical preferences for recommendation with mixture intention neural stochastic processes,

H. Liu, L. Jing, J. Yu, and M. K. Ng, “Learning hierarchical preferences for recommendation with mixture intention neural stochastic processes,” IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 7, pp. 3237–3251, 2024

2024
[38]

Leap & lean: Look-ahead planning and agile navigation for llm agents,

N. Verma and M. Bharadwaj, “Leap & lean: Look-ahead planning and agile navigation for llm agents,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), 2025, pp. 896–933

2025
[39]

Compositional coordination for multi-robot teams with large language models,

Z. Huang, G. Shi, Y . Wu, V . Kumar, and G. S. Sukhatme, “Compositional coordination for multi-robot teams with large language models,”arXiv preprint arXiv:2507.16068, 2025

work page arXiv 2025
[40]

Game agent driven by free-form text command: Using llm-based code generation and behavior branch,

R. Ito and J. Takahashi, “Game agent driven by free-form text command: Using llm-based code generation and behavior branch,”arXiv preprint arXiv:2402.07442, 2024

work page arXiv 2024
[41]

Mrbtp: Efficient multi-robot behavior tree planning and collaboration,

Y . Cai, X. Chen, Z. Cai, Y . Mao, M. Li, W. Yang, and J. Wang, “Mrbtp: Efficient multi-robot behavior tree planning and collaboration,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 14, 2025, pp. 14 548–14 557

2025
[42]

Integrating intent understanding and optimal behavior planning for behavior tree generation from human instructions,

X. Chen, Y . Cai, Y . Mao, M. Li, W. Yang, W. Xu, and J. Wang, “Integrating intent understanding and optimal behavior planning for behavior tree generation from human instructions,” inIJCAI, 2024

2024
[43]

Llm-driven self-refinement for embodied drone task planning,

D. Zhang, X. Zhang, J. Li, T. Long, X. Dai, Y . Fu, J. Zhang, J. Ren, and Y . Zhang, “Llm-driven self-refinement for embodied drone task planning,”arXiv preprint arXiv:2508.15501, 2025

work page arXiv 2025
[44]

Hbtp: Heuristic behavior tree planning with large language model reasoning,

Y . Cai, X. Chen, Y . Mao, M. Li, S. Yang, W. Yang, and J. Wang, “Hbtp: Heuristic behavior tree planning with large language model reasoning,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 706–13 713

2025
[45]

Adaptive command: Real-time policy adjustment via language models in starcraft ii,

W. Ma, D. Xu, S. Lin, H. Zhang, and J. Wang, “Adaptive command: Real-time policy adjustment via language models in starcraft ii,” in Proceedings of the 2024 6th International Conference on Distributed Artificial Intelligences, 2024, pp. 22–30

2024
[46]

Au- tomatic behavior tree expansion with llms for robotic manipulation,

J. Styrud, M. Iovino, M. Norrl ¨of, M. Bj ¨orkman, and C. Smith, “Au- tomatic behavior tree expansion with llms for robotic manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 1225–1232

2025
[47]

A survey on mixture of experts in large language models,

W. Cai, J. Jiang, F. Wang, J. Tang, S. Kim, and J. Huang, “A survey on mixture of experts in large language models,”IEEE Transactions on Knowledge and Data Engineering, 2025

2025
[48]

Optimizing long-term efficiency and fairness in ride-hailing under budget constraint via joint order dispatching and driver repositioning,

J. Sun, H. Jin, Z. Yang, and L. Su, “Optimizing long-term efficiency and fairness in ride-hailing under budget constraint via joint order dispatching and driver repositioning,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 7, pp. 3348–3362, 2024

2024
[49]

Evocurr: Self-evolving curriculum with behavior code generation for complex decision-making,

Y . Cheng, Z. Wang, W. Ma, W. Zhu, Y . Deng, and J. Zhao, “Evocurr: Self-evolving curriculum with behavior code generation for complex decision-making,”arXiv preprint arXiv:2508.09586, 2025

work page arXiv 2025
[50]

From debate to equilibrium: Belief-driven multi-agent llm reasoning via bayesian nash equilibrium,

X. Yi, Z. Zhou, C. Cao, Q. Niu, T. Liu, and B. Han, “From debate to equilibrium: Belief-driven multi-agent llm reasoning via bayesian nash equilibrium,” inForty-second International Conference on Machine Learning
[51]

Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops,

Q. Wang, T. Wang, Z. Tang, Q. Li, N. Chen, J. Liang, and B. He, “Megaagent: A large-scale autonomous llm-based multi-agent system without predefined sops,” inFindings of the Association for Computa- tional Linguistics: ACL 2025, 2025, pp. 4998–5036

2025
[52]

Tree-of-reasoning: Towards complex medical diagnosis via multi-agent reasoning with evidence tree,

Q. Peng, J. Cui, J. Xie, Y . Cai, and Q. Li, “Tree-of-reasoning: Towards complex medical diagnosis via multi-agent reasoning with evidence tree,”arXiv preprint arXiv:2508.03038, 2025

work page arXiv 2025
[53]

Mediator-guided multi-agent collaboration among open-source models for medical decision-making,

K. Chen, M. Liu, D. Zong, C. Ding, S. Rui, Y . Jiang, M. Zhou, and X. Wang, “Mediator-guided multi-agent collaboration among open-source models for medical decision-making,”arXiv preprint arXiv:2508.05996, 2025

work page arXiv 2025
[54]

ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents

J.-J. Shao, B.-W. Zhang, X.-W. Yang, B. Chen, S.-Y . Han, W.-D. Wei, G. Cai, Z. Dong, L.-Z. Guo, and Y .-f. Li, “Chinatravel: An open- ended benchmark for language agents in chinese travel planning,”arXiv preprint arXiv:2412.13682, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

A human-like reasoning framework for multi-phases planning task with large language models,

C. Xie and D. Zou, “A human-like reasoning framework for multi-phases planning task with large language models,” inICML 2024 Workshop on LLMs and Cognition

2024
[56]

InThe F ourteenth International Conference on Learning Representations

Y . Shao, V . Samuel, Y . Jiang, J. Yang, and D. Yang, “Collaborative gym: A framework for enabling and evaluating human-agent collaboration,” arXiv preprint arXiv:2412.15701, 2024

work page arXiv 2024
[57]

Optimizing large language models for dynamic constraints through human-in-the-loop discriminators,

T. Wei, A. Miin, and A. Miin, “Optimizing large language models for dynamic constraints through human-in-the-loop discriminators,”arXiv preprint arXiv:2410.15163, 2024

work page arXiv 2024
[58]

Revealing the barriers of language agents in planning,

J. Xie, K. Zhang, J. Chen, S. Yuan, K. Zhang, Y . Zhang, L. Li, and Y . Xiao, “Revealing the barriers of language agents in planning,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 1872–1888

2025
[59]

Can we rely on llm agents to draft long-horizon plans? let’s take travelplanner as an example,

Y . Chen, A. Pesaranghader, T. Sadhu, and D. H. Yi, “Can we rely on llm agents to draft long-horizon plans? let’s take travelplanner as an example,”arXiv preprint arXiv:2408.06318, 2024

work page arXiv 2024
[60]

Narrative-driven travel planning: Geoculturally-grounded script generation with evolutionary itinerary optimization,

Z. Zhang, R. Ding, Y . Zhu, Z. Kong, and P. Xu, “Narrative-driven travel planning: Geoculturally-grounded script generation with evolutionary itinerary optimization,”arXiv preprint arXiv:2502.14456, 2025

work page arXiv 2025
[61]

Language agent tree search unifies reasoning, acting, and planning in language models,

A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y .-X. Wang, “Language agent tree search unifies reasoning, acting, and planning in language models,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 62 138–62 160

2024
[62]

Beyond ex- amples: High-level automated reasoning paradigm in in-context learning via mcts,

J. Wu, M. Feng, S. Zhang, F. Che, Z. Wen, C. Liao, and J. Tao, “Beyond examples: High-level automated reasoning paradigm in in- context learning via mcts,”arXiv preprint arXiv:2411.18478, 2024

work page arXiv 2024
[63]

Reasoning with language model is planning with world model,

S. Hao, Y . Gu, H. Ma, J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” in NeurIPS 2023 Workshop on Generalization in Planning

2023
[64]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

2022
[65]

Logic-lm: Empowering large language models with symbolic solvers for faithful logical rea- soning,

L. Pan, A. Albalak, X. Wang, and W. Wang, “Logic-lm: Empowering large language models with symbolic solvers for faithful logical rea- soning,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 3806–3824. Duanyang Yuanis currently pursuing a Ph.D. de- gree at the National University of Defense Technol- ogy (NUDT). Her resea...

2023