pith. machine review for the scientific record. sign in

arxiv: 2402.02716 · v1 · submitted 2024-02-05 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 1 theorem link

Understanding the planning of LLM agents: A survey

Defu Lian, Enhong Chen, Hao Wang, Ruiming Tang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Xu Huang, Yasheng Wang

Pith reviewed 2026-05-13 18:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords LLM agentsplanningsurveytask decompositionplan selectionexternal modulereflectionmemory
0
0 comments X

The pith

LLM agent planning falls into five categories: task decomposition, plan selection, external modules, reflection, and memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models increasingly act as planners inside autonomous agents, but the ways they generate and refine plans sit scattered across individual papers. This survey collects those approaches and sorts them into a single taxonomy with five parts. It examines the techniques used in each part and notes the challenges that remain. A reader who grasps the structure can see how current methods relate and where further work is needed.

Core claim

The paper establishes that existing research on LLM-based agent planning can be organized into five directions—Task Decomposition, Plan Selection, External Module, Reflection, and Memory—supplies detailed analyses of each direction, and identifies open challenges for the field.

What carries the argument

The taxonomy that divides LLM-agent planning methods into Task Decomposition, Plan Selection, External Module, Reflection, and Memory.

If this is right

  • Methods inside each category become easier to compare directly.
  • New research can target specific gaps identified within one category.
  • Hybrid systems that draw techniques from several categories may improve overall performance.
  • The field gains a shared vocabulary for describing planning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent builders could test whether adding reflection or memory to existing decomposition methods raises success rates on long tasks.
  • Benchmarks might evaluate agents on each of the five dimensions separately to measure balanced improvement.
  • Pure text-based planning may remain limited until external modules or memory are routinely combined with it.

Load-bearing premise

The five categories capture the full space of LLM-agent planning methods without significant gaps or overlaps.

What would settle it

A new planning method for LLM agents that cannot be placed in any of the five categories would show the taxonomy is incomplete.

read the original abstract

As Large Language Models (LLMs) have shown significant intelligence, the progress to leverage LLMs as planning modules of autonomous agents has attracted more attention. This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability. We provide a taxonomy of existing works on LLM-Agent planning, which can be categorized into Task Decomposition, Plan Selection, External Module, Reflection and Memory. Comprehensive analyses are conducted for each direction, and further challenges for the field of research are discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper surveys recent literature on planning capabilities in LLM-based autonomous agents. It claims to offer the first systematic overview by proposing a taxonomy that organizes existing works into five categories—Task Decomposition, Plan Selection, External Module, Reflection, and Memory—followed by per-category analyses and a discussion of open challenges.

Significance. If the taxonomy is shown to be both comprehensive and non-overlapping, the survey would provide a useful organizing framework for a fast-moving subfield, helping researchers identify patterns across methods and prioritize future work on LLM agent planning. The absence of original empirical claims or derivations means its contribution rests entirely on the quality and coverage of the categorization and synthesis.

major comments (1)
  1. [Taxonomy] Taxonomy section (implied by abstract and described structure): the five-category partition is presented without explicit criteria or decision rules for assigning a method to one category versus another. This risks overlap (e.g., many reflection techniques rely on memory buffers) and potential omissions; the paper should supply a clear assignment protocol plus a table mapping at least 10 representative cited works to categories to demonstrate exhaustiveness.
minor comments (2)
  1. [Abstract] Abstract: the assertion that the survey is the 'first systematic view' should be supported by a brief comparison to prior LLM-agent surveys in the introduction or related-work section.
  2. [Analyses] The per-category analyses would benefit from a summary table listing key methods, their core mechanisms, and reported performance highlights to improve readability and comparability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the taxonomy concern below and will revise the manuscript accordingly to strengthen the presentation of the categorization framework.

read point-by-point responses
  1. Referee: [Taxonomy] Taxonomy section (implied by abstract and described structure): the five-category partition is presented without explicit criteria or decision rules for assigning a method to one category versus another. This risks overlap (e.g., many reflection techniques rely on memory buffers) and potential omissions; the paper should supply a clear assignment protocol plus a table mapping at least 10 representative cited works to categories to demonstrate exhaustiveness.

    Authors: We agree that the manuscript would benefit from explicit assignment criteria to minimize ambiguity around category boundaries. In the revised version, we will add a dedicated subsection in the Taxonomy section that defines an assignment protocol: a method is placed in the category corresponding to its primary planning mechanism (e.g., Reflection for iterative self-critique loops even if memory buffers are used secondarily; Memory for explicit storage/retrieval architectures). This protocol will be illustrated with decision rules and edge-case examples. We will also insert a new table mapping 15 representative works (selected for diversity across the five categories) to their assigned categories, with brief justification for each assignment. These additions directly address the risk of overlap and demonstrate coverage without altering the underlying taxonomy. revision: yes

Circularity Check

0 steps flagged

No significant circularity: descriptive survey taxonomy

full rationale

The paper is a literature survey proposing a five-category taxonomy (Task Decomposition, Plan Selection, External Module, Reflection, Memory) for LLM-agent planning research. It contains no equations, derivations, fitted parameters, predictions, or self-referential definitions. The taxonomy is presented as an organizational framework for existing works rather than a derived result; no load-bearing steps reduce to self-citation chains or by-construction equivalences. The central claim of providing a 'first systematic view' is supported by citation of prior literature without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; it relies entirely on the cited prior literature.

pith-pipeline@v0.9.0 · 5396 in / 916 out tokens · 42655 ms · 2026-05-13T18:08:37.632590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    cs.CL 2026-05 unverdicted novelty 8.0

    A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

  2. EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

  3. Uncertainty Propagation in LLM-Based Systems

    cs.SE 2026-04 unverdicted novelty 7.0

    This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...

  4. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  5. From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

    cs.AI 2026-04 unverdicted novelty 7.0

    OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).

  6. Evaluating Plan Compliance in Autonomous Programming Agents

    cs.SE 2026-04 unverdicted novelty 7.0

    Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...

  7. User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.

  8. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...

  9. From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

    cs.AI 2026-05 unverdicted novelty 6.0

    A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.

  10. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

    cs.AI 2026-05 unverdicted novelty 6.0

    FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.

  11. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 6.0

    Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...

  12. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  13. SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification

    cs.SE 2026-04 unverdicted novelty 6.0

    SpecSyn generates formal specifications with over 90% precision and 75% recall, successfully verifying 1071 out of 1365 target properties on open-source programs.

  14. Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

    cs.CL 2026-05 unverdicted novelty 5.0

    Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.

  15. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

    cs.AI 2026-05 conditional novelty 5.0

    Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

  16. Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

    cs.AI 2026-05 unverdicted novelty 5.0

    Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.

  17. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 5.0

    Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...

  18. Lightweight LLM Agent Memory with Small Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    LightMem uses SLMs to modularize agent memory into STM, MTM, and LTM with two-stage vector-plus-semantic retrieval online and incremental consolidation offline, reporting 2.5 F1 gains and low latency over A-MEM on LoCoMo.

  19. A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks

    cs.DC 2026-04 unverdicted novelty 4.0

    An LLM planner for task decomposition and a decomposition-aware scheduler in multi-user WiFi networks reduce average latency by 20% and improve overall reward by 80% versus local-only and nearest-edge baselines.

  20. Competition and Cooperation of LLM Agents in Games

    cs.MA 2026-04 unverdicted novelty 4.0

    LLM agents cooperate in two standard games due to fairness reasoning instead of converging to Nash equilibria under multi-round prompts.

  21. Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    cs.AI 2025-01 unverdicted novelty 4.0

    The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and ident...

  22. Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

    cs.AI 2026-04 unverdicted novelty 3.0

    Flowr is an agentic AI framework that decomposes retail supply chain workflows into coordinated LLM-based agents with human-in-the-loop oversight to automate operations in large supermarket chains.

  23. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

  24. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 22 Pith papers · 20 internal anchors

  1. [1]

    Pddl— the planning domain definition language

    [Aeronautiques et al., 1998] Constructions Aeronautiques, Adele Howe, et al. Pddl— the planning domain definition language. Technical Report, Tech. Rep.,

  2. [2]

    Learning from mistakes makes llm better reasoner

    [An et al., 2023] Shengnan An, Zexiong Ma, et al. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689,

  3. [3]

    Graph of thoughts: Solving elaborate problems with large language models

    [Besta et al., 2023] Maciej Besta, Nils Blach, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687,

  4. [4]

    Recent advances in retrieval-augmented text generation

    [Cai et al., 2022] Deng Cai, Yan Wang, Lemao Liu, and Shuming Shi. Recent advances in retrieval-augmented text generation. In SIGIR, pages 3417–3419,

  5. [5]

    Evaluating Large Language Models Trained on Code

    [Chen et al., 2021b] Mark Chen, Jerry Tworek, et al. Eval- uating large language models trained on code. arXiv preprint arXiv:2107.03374,

  6. [6]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    [Chen et al., 2022] Wenhu Chen, Xueguang Ma, et al. Pro- gram of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,

  7. [7]

    Dynamic planning with a llm

    [Dagan et al., 2023] Gautier Dagan, Frank Keller, and Alex Lascarides. Dynamic planning with a llm. arXiv preprint arXiv:2308.06391,

  8. [8]

    Mind2web: Towards a generalist agent for the web, 2023

    [Deng et al., 2023] Xiang Deng, Yu Gu, et al. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070,

  9. [9]

    Pal: Program-aided language models

    [Gao et al., 2023] Luyu Gao, Aman Madaan, et al. Pal: Program-aided language models. In ICML, pages 10764– 10799,

  10. [10]

    Lpg: A planner based on local search for planning graphs with action costs

    [Gerevini and Serina, 2002] Alfonso Gerevini and Ivan Se- rina. Lpg: A planner based on local search for planning graphs with action costs. In Aips, volume 2, pages 281– 290,

  11. [11]

    Auto- mated Planning: theory and practice

    [Ghallab et al., 2004] Malik Ghallab, Dana Nau, et al. Auto- mated Planning: theory and practice. Elsevier,

  12. [12]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    [Gou et al., 2023] Zhibin Gou, Zhihong Shao, et al. Critic: Large language models can self-correct with tool- interactive critiquing. arXiv preprint arXiv:2305.11738 ,

  13. [13]

    Leveraging pre-trained large language models to construct and uti- lize world models for model-based task planning,

    [Guan et al., 2023] Lin Guan, Karthik Valmeekam, et al. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. arXiv preprint arXiv:2305.14909,

  14. [14]

    arXiv preprint arXiv:2305.14992 , year=

    [Hao et al., 2023] Shibo Hao, Yi Gu, et al. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992,

  15. [15]

    An introduction to the planning domain definition lan- guage, volume

    [Haslum et al., 2019] Patrik Haslum, Nir Lipovetzky, et al. An introduction to the planning domain definition lan- guage, volume

  16. [16]

    Deep reinforce- ment learning with a natural language action space

    [He et al., 2015] Ji He, Jianshu Chen, et al. Deep reinforce- ment learning with a natural language action space. arXiv preprint arXiv:1511.04636,

  17. [17]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    [Huang et al., 2023a] Lei Huang, Yu Weijiang, et al. A sur- vey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232,

  18. [18]

    Rec- ommender ai agent: Integrating large language mod- els for interactive recommendations

    [Huang et al., 2023b] Xu Huang, Jianxun Lian, et al. Rec- ommender ai agent: Integrating large language mod- els for interactive recommendations. arXiv preprint arXiv:2308.16505,

  19. [19]

    Billion-scale similarity search with GPUs

    [Johnson et al., 2019] Jeff Johnson, Matthijs Douze, and Herv´e J ´egou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547,

  20. [20]

    Language models can solve computer tasks

    [Kim and others, 2023] Geunwoo Kim et al. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491,

  21. [21]

    Large language models are zero-shot reasoners

    [Kojima et al., 2022] Takeshi Kojima, Shixiang Shane Gu, et al. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213,

  22. [22]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    [Lewis et al., 2020] Patrick Lewis, Ethan Perez, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33:9459–9474,

  23. [23]

    Swift- sage: A generative agent with fast and slow think- ing for complex interactive tasks

    [Lin et al., 2023] Bill Yuchen Lin, Yicheng Fu, et al. Swift- sage: A generative agent with fast and slow think- ing for complex interactive tasks. arXiv preprint arXiv:2305.17390,

  24. [24]

    Width and inference based planners: Siw, bfs (f), and probe

    [Lipovetzky et al., 2014] Nir Lipovetzky, Miquel Ramirez, et al. Width and inference based planners: Siw, bfs (f), and probe. IPC, page 43,

  25. [25]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    [Liu et al., 2023a] Bo Liu, Yuqian Jiang, et al. Llm+ p: Em- powering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477,

  26. [26]

    Think-in- memory: Recalling and post-thinking enable llms with long-term memory

    [Liu et al., 2023b] Lei Liu, Xiaoyan Yang, et al. Think-in- memory: Recalling and post-thinking enable llms with long-term memory. arXiv preprint arXiv:2311.08719 ,

  27. [27]

    AgentBench: Evaluating LLMs as Agents

    [Liu et al., 2023c] Xiao Liu, Hao Yu, et al. Agent- bench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688,

  28. [28]

    Self-Refine: Iterative Refinement with Self-Feedback

    [Madaan et al., 2023] Aman Madaan, Niket Tandon, , et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651,

  29. [29]

    Generation-augmented retrieval for open-domain question answering

    [Mao et al., 2020] Yuning Mao, Pengcheng He, Liu, et al. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553,

  30. [30]

    MemGPT: Towards LLMs as Operating Systems

    [Packer et al., 2023] Charles Packer, Vivian Fang, et al. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560,

  31. [31]

    Unifying large language models and knowledge graphs: A roadmap

    [Pan et al., 2024] Shirui Pan, Linhao Luo, et al. Unifying large language models and knowledge graphs: A roadmap. TKDE,

  32. [32]

    Generative agents: Interactive simulacra of human behav- ior

    [Park et al., 2023] Joon Sung Park, Joseph O’Brien, et al. Generative agents: Interactive simulacra of human behav- ior. In SUIST, pages 1–22,

  33. [33]

    arXiv preprint arXiv.2304.08354,

    [Qin et al., 2023] Yujia Qin, Shengding Hu, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354,

  34. [34]

    Cognitive task analysis

    [Schraagen et al., 2000] Jan Maarten Schraagen, Susan F Chipman, et al. Cognitive task analysis. Psychology Press,

  35. [35]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    [Shen et al., 2023] Yongliang Shen, Kaitao Song, et al. Hug- ginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580,

  36. [36]

    Reflexion: Language agents with verbal reinforcement learning

    [Shinn et al., 2023] Noah Shinn, Federico Cassano, et al. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS,

  37. [37]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    [Shridhar et al., 2020] Mohit Shridhar, Xingdi Yuan, et al. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768 ,

  38. [38]

    Prog- prompt: Generating situated robot task plans using large language models

    [Singh et al., 2023] Ishika Singh, Valts Blukis, et al. Prog- prompt: Generating situated robot task plans using large language models. In ICRA 2023 , pages 11523–11530. IEEE,

  39. [39]

    A survey of reasoning with foundation models

    [Sun et al., 2023] Jiankai Sun, Chuanyang Zheng, et al. A survey of reasoning with foundation models. arXiv preprint arXiv:2312.11562,

  40. [40]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    [Thorne et al., 2018] James Thorne, Andreas Vlachos, et al. Fever: a large-scale dataset for fact extraction and verifi- cation. arXiv preprint arXiv:1803.05355,

  41. [41]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    [Touvron et al., 2023] Hugo Touvron, Louis Martin, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  42. [42]

    Sci- enceworld: Is your agent smarter than a 5th grader? arXiv preprint arXiv:2203.07540,

    [Wang et al., 2022a] Ruoyao Wang, Peter Jansen, et al. Sci- enceworld: Is your agent smarter than a 5th grader? arXiv preprint arXiv:2203.07540,

  43. [43]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    [Wang et al., 2022b] Xuezhi Wang, Jason Wei, et al. Self- consistency improves chain of thought reasoning in lan- guage models. arXiv preprint arXiv:2203.11171,

  44. [44]

    arXiv preprint arXiv:2308.11432 , year=

    [Wang et al., 2023a] Lei Wang, Chen Ma, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432,

  45. [45]

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

    [Wang et al., 2023b] Lei Wang, Wanyu Xu, et al. Plan-and- solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091,

  46. [46]

    Recmind: Large language model powered agent for rec- ommendation

    [Wang et al., 2023c] Yancheng Wang, Ziyan Jiang, et al. Recmind: Large language model powered agent for rec- ommendation. arXiv preprint arXiv:2308.14296,

  47. [47]

    Chain- of-thought prompting elicits reasoning in large language models

    [Wei et al., 2022] Jason Wei, Xuezhi Wang, et al. Chain- of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837,

  48. [48]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    [Wu et al., 2023] Chenfei Wu, Shengming Yin, et al. Visual chatgpt: Talking, drawing and editing with visual founda- tion models. arXiv preprint arXiv:2303.04671,

  49. [49]

    C-pack: Packaged resources to advance general chinese embedding,

    [Xiao and others, 2023] Shitao Xiao et al. C-pack: Packaged resources to advance general chinese embedding,

  50. [50]

    Llm a*: Human in the loop large language models enabled a* search for robotics

    [Xiao and Wang, 2023] Hengjia Xiao and Peng Wang. Llm a*: Human in the loop large language models enabled a* search for robotics. arXiv preprint arXiv:2312.01797,

  51. [51]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    [Yang et al., 2018] Zhilin Yang, Peng Qi, et al. Hotpotqa: A dataset for diverse, explainable multi-hop question an- swering. arXiv preprint arXiv:1809.09600,

  52. [52]

    Foun- dation models for decision making: Problems, meth- ods, and opportunities

    [Yang et al., 2023a] Sherry Yang, Nachum Ofir, et al. Foun- dation models for decision making: Problems, meth- ods, and opportunities. arXiv preprint arXiv:2303.04129,

  53. [53]

    Coupling large language models with logic program- ming for robust and general reasoning from text

    [Yang et al., 2023b] Zhun Yang, Adam Ishay, and Joohyung Lee. Coupling large language models with logic program- ming for robust and general reasoning from text. arXiv preprint arXiv:2307.07696,

  54. [55]

    Keep calm and explore: Language models for action generation in text-based games

    [Yao et al., 2020b] Shunyu Yao, Rohan Rao, et al. Keep calm and explore: Language models for action generation in text-based games. arXiv preprint arXiv:2010.02903 ,

  55. [56]

    ReAct: Synergizing Reasoning and Acting in Language Models

    [Yao et al., 2022] Shunyu Yao, Jeffrey Zhao, et al. Re- act: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,

  56. [57]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    [Yao et al., 2023] Shunyu Yao, Dian Yu, et al. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601,

  57. [58]

    Agent- tuning: Enabling generalized agent abilities for llms.arXiv preprint arXiv:2310.12823,

    [Zeng et al., 2023] Aohan Zeng, Mingdao Liu, et al. Agent- tuning: Enabling generalized agent abilities for llms.arXiv preprint arXiv:2310.12823,

  58. [59]

    Large language model is semi-parametric reinforcement learning agent

    [Zhang et al., 2023a] Danyang Zhang, Lu Chen, et al. Large language model is semi-parametric reinforcement learning agent. arXiv preprint arXiv:2306.07929,

  59. [60]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    [Zhang et al., 2023b] Yue Zhang, Yafu Li, et al. Siren’s song in the ai ocean: A survey on hallucination in large lan- guage models. arXiv preprint arXiv:2309.01219,

  60. [61]

    A Survey of Large Language Models

    [Zhao et al., 2023a] Wayne Xin Zhao, Kun Zhou, et al. A survey of large language models. arXiv preprint arXiv:2303.18223,

  61. [62]

    Large language models as commonsense knowl- edge for large-scale task planning

    [Zhao et al., 2023b] Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowl- edge for large-scale task planning. arXiv preprint arXiv:2305.14078,

  62. [63]

    Memorybank: Enhancing large language models with long-term memory

    [Zhong et al., 2023] Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250,

  63. [64]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    [Zhou et al., 2023] Shuyan Zhou, Frank F Xu, et al. We- barena: A realistic web environment for building au- tonomous agents. arXiv preprint arXiv:2307.13854, 2023