Understanding the planning of LLM agents: A survey

Defu Lian; Enhong Chen; Hao Wang; Ruiming Tang; Weiwen Liu; Xiaolong Chen; Xingmei Wang; Xu Huang; Yasheng Wang

arxiv: 2402.02716 · v1 · submitted 2024-02-05 · 💻 cs.AI · cs.CL· cs.LG

Understanding the planning of LLM agents: A survey

Xu Huang , Weiwen Liu , Xiaolong Chen , Xingmei Wang , Hao Wang , Defu Lian , Yasheng Wang , Ruiming Tang

show 1 more author

Enhong Chen

This is my paper

Pith reviewed 2026-05-13 18:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM agentsplanningsurveytask decompositionplan selectionexternal modulereflectionmemory

0 comments

The pith

LLM agent planning falls into five categories: task decomposition, plan selection, external modules, reflection, and memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models increasingly act as planners inside autonomous agents, but the ways they generate and refine plans sit scattered across individual papers. This survey collects those approaches and sorts them into a single taxonomy with five parts. It examines the techniques used in each part and notes the challenges that remain. A reader who grasps the structure can see how current methods relate and where further work is needed.

Core claim

The paper establishes that existing research on LLM-based agent planning can be organized into five directions—Task Decomposition, Plan Selection, External Module, Reflection, and Memory—supplies detailed analyses of each direction, and identifies open challenges for the field.

What carries the argument

The taxonomy that divides LLM-agent planning methods into Task Decomposition, Plan Selection, External Module, Reflection, and Memory.

If this is right

Methods inside each category become easier to compare directly.
New research can target specific gaps identified within one category.
Hybrid systems that draw techniques from several categories may improve overall performance.
The field gains a shared vocabulary for describing planning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent builders could test whether adding reflection or memory to existing decomposition methods raises success rates on long tasks.
Benchmarks might evaluate agents on each of the five dimensions separately to measure balanced improvement.
Pure text-based planning may remain limited until external modules or memory are routinely combined with it.

Load-bearing premise

The five categories capture the full space of LLM-agent planning methods without significant gaps or overlaps.

What would settle it

A new planning method for LLM agents that cannot be placed in any of the five categories would show the taxonomy is incomplete.

read the original abstract

As Large Language Models (LLMs) have shown significant intelligence, the progress to leverage LLMs as planning modules of autonomous agents has attracted more attention. This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability. We provide a taxonomy of existing works on LLM-Agent planning, which can be categorized into Task Decomposition, Plan Selection, External Module, Reflection and Memory. Comprehensive analyses are conducted for each direction, and further challenges for the field of research are discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A useful survey that organizes LLM-agent planning work into a five-part taxonomy but adds no new methods or results.

read the letter

This survey gives a structured overview of how LLMs are used for planning inside agents. It breaks the literature into five categories—Task Decomposition, Plan Selection, External Module, Reflection, and Memory—and reviews recent papers under each heading while listing open challenges at the end. The main value is that it pulls scattered work into one place at a time when the area is expanding quickly. The groupings are reasonable on the surface and the challenge section flags practical issues like long-horizon reliability without overclaiming. For anyone new to the topic or trying to map what has been tried, the paper saves time by pointing to the key directions and citations. It does not invent any techniques or run experiments, so its contribution is entirely organizational. The taxonomy is presented as comprehensive, yet the paper gives no detailed argument for why these five buckets are exhaustive or free of overlap. In practice many systems combine reflection with memory or use external modules inside decomposition steps, so the divisions can feel somewhat artificial. Coverage depends on the authors' search process, and without a quantitative trend analysis or explicit inclusion criteria it is hard to judge how complete the picture really is. This paper is aimed at researchers and students who need a quick map of LLM-agent planning rather than implementers looking for ready-to-use algorithms. It can serve as a reference point for the subfield. I would send it for peer review. The structure is clear enough that referees can check for missing papers or suggest refinements to the categories, and a polished version would likely get cited as an entry point even if it does not push the technical state of the art.

Referee Report

1 major / 2 minor

Summary. The paper surveys recent literature on planning capabilities in LLM-based autonomous agents. It claims to offer the first systematic overview by proposing a taxonomy that organizes existing works into five categories—Task Decomposition, Plan Selection, External Module, Reflection, and Memory—followed by per-category analyses and a discussion of open challenges.

Significance. If the taxonomy is shown to be both comprehensive and non-overlapping, the survey would provide a useful organizing framework for a fast-moving subfield, helping researchers identify patterns across methods and prioritize future work on LLM agent planning. The absence of original empirical claims or derivations means its contribution rests entirely on the quality and coverage of the categorization and synthesis.

major comments (1)

[Taxonomy] Taxonomy section (implied by abstract and described structure): the five-category partition is presented without explicit criteria or decision rules for assigning a method to one category versus another. This risks overlap (e.g., many reflection techniques rely on memory buffers) and potential omissions; the paper should supply a clear assignment protocol plus a table mapping at least 10 representative cited works to categories to demonstrate exhaustiveness.

minor comments (2)

[Abstract] Abstract: the assertion that the survey is the 'first systematic view' should be supported by a brief comparison to prior LLM-agent surveys in the introduction or related-work section.
[Analyses] The per-category analyses would benefit from a summary table listing key methods, their core mechanisms, and reported performance highlights to improve readability and comparability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the taxonomy concern below and will revise the manuscript accordingly to strengthen the presentation of the categorization framework.

read point-by-point responses

Referee: [Taxonomy] Taxonomy section (implied by abstract and described structure): the five-category partition is presented without explicit criteria or decision rules for assigning a method to one category versus another. This risks overlap (e.g., many reflection techniques rely on memory buffers) and potential omissions; the paper should supply a clear assignment protocol plus a table mapping at least 10 representative cited works to categories to demonstrate exhaustiveness.

Authors: We agree that the manuscript would benefit from explicit assignment criteria to minimize ambiguity around category boundaries. In the revised version, we will add a dedicated subsection in the Taxonomy section that defines an assignment protocol: a method is placed in the category corresponding to its primary planning mechanism (e.g., Reflection for iterative self-critique loops even if memory buffers are used secondarily; Memory for explicit storage/retrieval architectures). This protocol will be illustrated with decision rules and edge-case examples. We will also insert a new table mapping 15 representative works (selected for diversity across the five categories) to their assigned categories, with brief justification for each assignment. These additions directly address the risk of overlap and demonstrate coverage without altering the underlying taxonomy. revision: yes

Circularity Check

0 steps flagged

No significant circularity: descriptive survey taxonomy

full rationale

The paper is a literature survey proposing a five-category taxonomy (Task Decomposition, Plan Selection, External Module, Reflection, Memory) for LLM-agent planning research. It contains no equations, derivations, fitted parameters, predictions, or self-referential definitions. The taxonomy is presented as an organizational framework for existing works rather than a derived result; no load-bearing steps reduce to self-citation chains or by-construction equivalences. The central claim of providing a 'first systematic view' is supported by citation of prior literature without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; it relies entirely on the cited prior literature.

pith-pipeline@v0.9.0 · 5396 in / 916 out tokens · 42655 ms · 2026-05-13T18:08:37.632590+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents
cs.SE 2026-06 unverdicted novelty 8.0

RigorBench is the first benchmark for process discipline in autonomous AI coding agents, reporting 41% higher process quality scores and 17% higher outcome correctness when agents follow structured engineering practices.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Toward Agentic SysAdmin: Rethinking System Administration with AI Agents
cs.NI 2026-06 unverdicted novelty 7.0

NetLLMeval is an emulation-based framework for benchmarking LLM solvers on network admin tasks, with a 24000-run study showing solver architecture lifts a 14B model from 0.43 to 0.88 accuracy and allows local models t...
RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents
cs.SE 2026-06 unverdicted novelty 7.0

RigorBench evaluates AI coding agents on process discipline via five pillars and reports 41% higher process scores and 17% better outcome correctness with structured approaches on 30 tasks.
CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments
cs.CL 2026-06 unverdicted novelty 7.0

CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
cs.CV 2026-05 unverdicted novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
Uncertainty Propagation in LLM-Based Systems
cs.SE 2026-04 unverdicted novelty 7.0

This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
cs.AI 2026-04 unverdicted novelty 7.0

OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
Evaluating Plan Compliance in Autonomous Programming Agents
cs.SE 2026-04 unverdicted novelty 7.0

Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...
User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents
q-bio.QM 2025-10 unverdicted novelty 7.0

GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,...
The Challenge and Reward of Fair Play in Narrative: A Computational Approach
cs.CL 2025-07 unverdicted novelty 7.0

Develops an information-theoretic framework showing surprise and coherence trade off in single reader models but coexist via pre- and post-revelation modes, operationalized as reference-less LLM metrics for fair play ...
FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
cs.CV 2025-06 unverdicted novelty 7.0

FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning
cs.LG 2026-06 unverdicted novelty 6.0

G2PO transforms linear trajectories into graphs, aggregates identical states for lower-variance value estimates, and uses edge-centric TD standardization, reporting up to 22.2% gains over GRPO on WebShop, ALFWorld, an...
Uncertainty Decomposition for Clarification Seeking in LLM Agents
cs.AI 2026-06 unverdicted novelty 6.0

A prompt-based uncertainty decomposition separates action confidence from request uncertainty to enable clarification seeking in LLM agents, yielding F1 gains of 73% and 36% over baselines on two new underspecified be...
OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation
cs.CL 2026-06 unverdicted novelty 6.0

OPD-Evolver uses on-policy self-distillation in fast interaction and slow attribution loops to build agents with holistic memory competence, outperforming prior systems by up to 11.5% and allowing a 9B model to compet...
Formalizing and Mitigating Structural Distortion in LLM Attention for Graph Reasoning
cs.LG 2026-06 unverdicted novelty 6.0

Rotary embeddings create bandwidth-dependent attention decay during graph linearization; GaLA corrects this at inference time to boost performance on text-attributed graphs.
Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents
cs.CL 2026-06 unverdicted novelty 6.0

Autopilot enforces verifiable termination via a gated FSM scheduler and hard floor, proving that termination implies goal achievement under gate soundness, floor enforcement, and plan coverage, while cutting fabricati...
SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems
cs.MA 2026-06 unverdicted novelty 6.0

SAIGuard uses communication-state simulation on the MAS interaction graph to detect and sanitize risky messages via reconstruction deviations, reducing attack success while preserving utility.
What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems
cs.CR 2026-06 unverdicted novelty 6.0

Formalizes stored prompt injection in agentic systems, develops a taxonomy and benchmark to show how adversarial prompts can persist across sessions via persistent state artifacts.
How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning
cs.MA 2026-05 unverdicted novelty 6.0

Formalizes design space for human-LLM collaborative planning along mode, scope, and level axes; evaluates AMBIPOM prototype via user study and benchmark revealing hybrid workflows and trade-offs.
BLAgent: Agentic RAG for File-Level Bug Localization
cs.SE 2026-05 unverdicted novelty 6.0

BLAgent achieves over 78% top-1 file-level bug localization accuracy on SWE-bench-Lite with open-source models and over 86% with closed-source models while being over 18x cheaper than the strongest baseline.
BLAgent: Agentic RAG for File-Level Bug Localization
cs.SE 2026-05 unverdicted novelty 6.0

BLAgent achieves over 78% Top-1 accuracy on SWE-bench Lite for file-level bug localization using agentic RAG, at 18x lower cost than baselines, and boosts end-to-end APR success by over 20%.
PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship
cs.HC 2026-05 unverdicted novelty 6.0

PULSE demonstrates that agentic LLM-based investigation of passive smartphone sensing data achieves balanced accuracies of 0.743 (with diary) and 0.713 (sensing-only) for predicting emotion regulation desire and inter...
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
cs.AI 2026-05 unverdicted novelty 6.0

A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
cs.AI 2026-05 unverdicted novelty 6.0

FitText embeds evolutionary retrieval of tool descriptions into the agent loop, yielding 2.7-10.6 point NDCG@5 gains on ToolRet and 26.7-point pass-rate gains on StableToolBench.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
cs.AI 2026-05 unverdicted novelty 6.0

FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
QuantClaw: Precision Where It Matters for OpenClaw
cs.AI 2026-04 unverdicted novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification
cs.SE 2026-04 unverdicted novelty 6.0

SpecSyn generates formal specifications with over 90% precision and 75% recall, successfully verifying 1071 out of 1365 target properties on open-source programs.
From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration
cs.MA 2026-03 unverdicted novelty 6.0

A graph-based propagation model for error cascades in LLM multi-agent systems plus a genealogy-graph governance plugin that prevents final infection in at least 89% of runs across tested frameworks.
HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
cs.AI 2026-03 unverdicted novelty 6.0

HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks
cs.HC 2025-10 conditional novelty 6.0

A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versu...
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
cs.CL 2025-09 unverdicted novelty 6.0

VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
cs.AI 2025-06 unverdicted novelty 6.0

Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile...
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
cs.AI 2025-04 unverdicted novelty 6.0

InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...
Retrieval-Augmented Generation for Natural Language Processing: A Survey
cs.CL 2024-07 accept novelty 6.0

The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.
Buildrix: An Open Platform for Sharing and Benchmarking Agentic AI Skills in Building Engineering
eess.SY 2026-06 unverdicted novelty 5.0

Buildrix is presented as an open platform for developing, sharing, executing, and evaluating agentic AI skills for building engineering workflows.
A Technical Taxonomy of LLM Agent Communication Protocols
cs.MA 2026-06 unverdicted novelty 5.0

Creates a five-dimension taxonomy (counterparty, payload, interaction state, discovery mechanism, schema flexibility) from nine protocols and identifies architectural patterns plus convergence trends.
TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning
cs.CL 2026-06 unverdicted novelty 5.0

TabClaw is an interactive LLM agent for spreadsheets that exposes editable plans, uses parallel specialist agents, streams ReAct loops, and distills skills from user feedback, reporting improved benchmark task completion.
SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems
cs.AI 2026-05 unverdicted novelty 5.0

SkillSmith introduces a synergy-aware skill-tool co-evolution framework with atomic bundles, Lotka-Volterra-inspired interaction modeling, and anti-pattern recording that outperforms baselines on complex tasks.
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
cs.AI 2026-05 unverdicted novelty 5.0

MUSE-Autoskill introduces a skill-centric framework for self-evolving LLM agents through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.
Reframing LLM Agent Security as an Agent-Human Interaction Problem
cs.CR 2026-05 unverdicted novelty 5.0

LLM agent security is reframed as an agent-human interaction issue, supported by a survey showing industry preference for human-centric mechanisms over academic favorites and proposing a new research agenda.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification
cs.MA 2026-05 unverdicted novelty 5.0

IFPV integrates multi-perspective hierarchical agents for generative planning with an adversarial cognitive simulation engine for verification, reporting 19.4% higher mission success, 41.7% lower cost versus LLM basel...
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and Se...
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
cs.CL 2026-05 unverdicted novelty 5.0

Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
cs.AI 2026-05 unverdicted novelty 5.0

Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 5.0

Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
Lightweight LLM Agent Memory with Small Language Models
cs.AI 2026-04 unverdicted novelty 5.0

LightMem uses SLMs to modularize agent memory into STM, MTM, and LTM with two-stage vector-plus-semantic retrieval online and incremental consolidation offline, reporting 2.5 F1 gains and low latency over A-MEM on LoCoMo.
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
cs.AI 2026-06 unverdicted novelty 4.0

The paper proposes a unified MDP-based research agenda for addressing sim-to-real gaps in foundation model agents and advocates adopting classical solutions such as domain randomization.
RIZZ: Routing Interactions to Near Zero-Interference Zones for Continual Adaptation of Black-Box Agents
cs.AI 2026-06 unverdicted novelty 4.0

RIZZ is a continual adaptation framework for black-box LLM agents that uses dynamically spawned memory branches, context-aware routing, verifier-gated updates, and prompt compilation to control interference across non...
A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks
cs.DC 2026-04 unverdicted novelty 4.0

An LLM planner for task decomposition and a decomposition-aware scheduler in multi-user WiFi networks reduce average latency by 20% and improve overall reward by 80% versus local-only and nearest-edge baselines.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 68 Pith papers · 27 internal anchors

[1]

Pddl— the planning domain definition language

[Aeronautiques et al., 1998] Constructions Aeronautiques, Adele Howe, et al. Pddl— the planning domain definition language. Technical Report, Tech. Rep.,

work page 1998
[2]

Learning from mistakes makes llm better reasoner

[An et al., 2023] Shengnan An, Zexiong Ma, et al. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689,

work page arXiv 2023
[3]

AAAI38(2024), https://arxiv.org/abs/2308.09687

[Besta et al., 2023] Maciej Besta, Nils Blach, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687,

work page arXiv 2023
[4]

Recent advances in retrieval-augmented text generation

[Cai et al., 2022] Deng Cai, Yan Wang, Lemao Liu, and Shuming Shi. Recent advances in retrieval-augmented text generation. In SIGIR, pages 3417–3419,

work page 2022
[5]

Evaluating Large Language Models Trained on Code

[Chen et al., 2021b] Mark Chen, Jerry Tworek, et al. Eval- uating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

[Chen et al., 2022] Wenhu Chen, Xueguang Ma, et al. Pro- gram of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Dynamic planning with a llm,

[Dagan et al., 2023] Gautier Dagan, Frank Keller, and Alex Lascarides. Dynamic planning with a llm. arXiv preprint arXiv:2308.06391,

work page arXiv 2023
[8]

Mind2Web: Towards a Generalist Agent for the Web

[Deng et al., 2023] Xiang Deng, Yu Gu, et al. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070,

work page internal anchor Pith review arXiv 2023
[9]

Pal: Program-aided language models

[Gao et al., 2023] Luyu Gao, Aman Madaan, et al. Pal: Program-aided language models. In ICML, pages 10764– 10799,

work page 2023
[10]

Lpg: A planner based on local search for planning graphs with action costs

[Gerevini and Serina, 2002] Alfonso Gerevini and Ivan Se- rina. Lpg: A planner based on local search for planning graphs with action costs. In Aips, volume 2, pages 281– 290,

work page 2002
[11]

Auto- mated Planning: theory and practice

[Ghallab et al., 2004] Malik Ghallab, Dana Nau, et al. Auto- mated Planning: theory and practice. Elsevier,

work page 2004
[12]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

[Gou et al., 2023] Zhibin Gou, Zhihong Shao, et al. Critic: Large language models can self-correct with tool- interactive critiquing. arXiv preprint arXiv:2305.11738 ,

work page internal anchor Pith review arXiv 2023
[13]

Advances in Neural Information Processing Systems (NeurIPS) , year =

[Guan et al., 2023] Lin Guan, Karthik Valmeekam, et al. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. arXiv preprint arXiv:2305.14909,

work page arXiv 2023
[14]

Reasoning with Language Model is Planning with World Model

[Hao et al., 2023] Shibo Hao, Yi Gu, et al. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992,

work page internal anchor Pith review arXiv 2023
[15]

An introduction to the planning domain definition lan- guage, volume

[Haslum et al., 2019] Patrik Haslum, Nir Lipovetzky, et al. An introduction to the planning domain definition lan- guage, volume

work page 2019
[16]

Deep Reinforcement Learning with a Natural Language Action Space

[He et al., 2015] Ji He, Jianshu Chen, et al. Deep reinforce- ment learning with a natural language action space. arXiv preprint arXiv:1511.04636,

work page Pith review arXiv 2015
[17]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

[Huang et al., 2023a] Lei Huang, Yu Weijiang, et al. A sur- vey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Recommender ai agent: Integrating large language models for interactive recommendations.arXiv preprint arXiv:2308.16505, 2023

[Huang et al., 2023b] Xu Huang, Jianxun Lian, et al. Rec- ommender ai agent: Integrating large language mod- els for interactive recommendations. arXiv preprint arXiv:2308.16505,

work page arXiv
[19]

Billion-scale similarity search with GPUs

[Johnson et al., 2019] Jeff Johnson, Matthijs Douze, and Herv´e J ´egou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547,

work page 2019
[20]

Language Models can Solve Computer Tasks

[Kim and others, 2023] Geunwoo Kim et al. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491,

work page internal anchor Pith review arXiv 2023
[21]

Large language models are zero-shot reasoners

[Kojima et al., 2022] Takeshi Kojima, Shixiang Shane Gu, et al. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213,

work page 2022
[22]

Retrieval-augmented generation for knowledge-intensive nlp tasks

[Lewis et al., 2020] Patrick Lewis, Ethan Perez, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33:9459–9474,

work page 2020
[23]

arXiv preprint arXiv:2305.17390 , year=

[Lin et al., 2023] Bill Yuchen Lin, Yicheng Fu, et al. Swift- sage: A generative agent with fast and slow think- ing for complex interactive tasks. arXiv preprint arXiv:2305.17390,

work page arXiv 2023
[24]

Width and inference based planners: Siw, bfs (f), and probe

[Lipovetzky et al., 2014] Nir Lipovetzky, Miquel Ramirez, et al. Width and inference based planners: Siw, bfs (f), and probe. IPC, page 43,

work page 2014
[25]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

[Liu et al., 2023a] Bo Liu, Yuqian Jiang, et al. Llm+ p: Em- powering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477,

work page internal anchor Pith review arXiv
[26]

arXiv preprint arXiv:2311.08719 , year=

[Liu et al., 2023b] Lei Liu, Xiaoyan Yang, et al. Think-in- memory: Recalling and post-thinking enable llms with long-term memory. arXiv preprint arXiv:2311.08719 ,

work page arXiv
[27]

AgentBench: Evaluating LLMs as Agents

[Liu et al., 2023c] Xiao Liu, Hao Yu, et al. Agent- bench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Self-Refine: Iterative Refinement with Self-Feedback

[Madaan et al., 2023] Aman Madaan, Niket Tandon, , et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Generation-augmented retrieval for open-domain question answering

[Mao et al., 2020] Yuning Mao, Pengcheng He, Liu, et al. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553,

work page arXiv 2020
[30]

MemGPT: Towards LLMs as Operating Systems

[Packer et al., 2023] Charles Packer, Vivian Fang, et al. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Unifying large language models and knowledge graphs: A roadmap

[Pan et al., 2024] Shirui Pan, Linhao Luo, et al. Unifying large language models and knowledge graphs: A roadmap. TKDE,

work page 2024
[32]

Generative agents: Interactive simulacra of human behav- ior

[Park et al., 2023] Joon Sung Park, Joseph O’Brien, et al. Generative agents: Interactive simulacra of human behav- ior. In SUIST, pages 1–22,

work page 2023
[33]

Tool Learning with Foundation Models

[Qin et al., 2023] Yujia Qin, Shengding Hu, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354,

work page internal anchor Pith review arXiv 2023
[34]

Cognitive task analysis

[Schraagen et al., 2000] Jan Maarten Schraagen, Susan F Chipman, et al. Cognitive task analysis. Psychology Press,

work page 2000
[35]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

[Shen et al., 2023] Yongliang Shen, Kaitao Song, et al. Hug- ginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580,

work page internal anchor Pith review arXiv 2023
[36]

Reflexion: Language agents with verbal reinforcement learning

[Shinn et al., 2023] Noah Shinn, Federico Cassano, et al. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS,

work page 2023
[37]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

[Shridhar et al., 2020] Mohit Shridhar, Xingdi Yuan, et al. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768 ,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[38]

Prog- prompt: Generating situated robot task plans using large language models

[Singh et al., 2023] Ishika Singh, Valts Blukis, et al. Prog- prompt: Generating situated robot task plans using large language models. In ICRA 2023 , pages 11523–11530. IEEE,

work page 2023
[39]

A survey of reasoning with foundation models

[Sun et al., 2023] Jiankai Sun, Chuanyang Zheng, et al. A survey of reasoning with foundation models. arXiv preprint arXiv:2312.11562,

work page arXiv 2023
[40]

FEVER: a large-scale dataset for Fact Extraction and VERification

[Thorne et al., 2018] James Thorne, Andreas Vlachos, et al. Fever: a large-scale dataset for fact extraction and verifi- cation. arXiv preprint arXiv:1803.05355,

work page internal anchor Pith review arXiv 2018
[41]

Llama 2: Open Foundation and Fine-Tuned Chat Models

[Touvron et al., 2023] Hugo Touvron, Louis Martin, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Science- World: Is your Agent Smarter than a 5th Grader?, 2022

[Wang et al., 2022a] Ruoyao Wang, Peter Jansen, et al. Sci- enceworld: Is your agent smarter than a 5th grader? arXiv preprint arXiv:2203.07540,

work page arXiv
[43]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

[Wang et al., 2022b] Xuezhi Wang, Jason Wei, et al. Self- consistency improves chain of thought reasoning in lan- guage models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

A Survey on Large Language Model based Autonomous Agents

[Wang et al., 2023a] Lei Wang, Chen Ma, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432,

work page internal anchor Pith review arXiv
[45]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

[Wang et al., 2023b] Lei Wang, Wanyu Xu, et al. Plan-and- solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091,

work page internal anchor Pith review arXiv
[46]

Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

[Wang et al., 2023c] Yancheng Wang, Ziyan Jiang, et al. Recmind: Large language model powered agent for rec- ommendation. arXiv preprint arXiv:2308.14296,

work page arXiv
[47]

Chain- of-thought prompting elicits reasoning in large language models

[Wei et al., 2022] Jason Wei, Xuezhi Wang, et al. Chain- of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837,

work page 2022
[48]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

[Wu et al., 2023] Chenfei Wu, Shengming Yin, et al. Visual chatgpt: Talking, drawing and editing with visual founda- tion models. arXiv preprint arXiv:2303.04671,

work page internal anchor Pith review arXiv 2023
[49]

C-pack: Packaged resources to advance general chinese embedding,

[Xiao and others, 2023] Shitao Xiao et al. C-pack: Packaged resources to advance general chinese embedding,

work page 2023
[50]

Llm a*: Human in the loop large language models enabled a* search for robotics

[Xiao and Wang, 2023] Hengjia Xiao and Peng Wang. Llm a*: Human in the loop large language models enabled a* search for robotics. arXiv preprint arXiv:2312.01797,

work page arXiv 2023
[51]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

[Yang et al., 2018] Zhilin Yang, Peng Qi, et al. Hotpotqa: A dataset for diverse, explainable multi-hop question an- swering. arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[52]

Foundation models for decision making: Problems, methods, and opportunities, 2023

[Yang et al., 2023a] Sherry Yang, Nachum Ofir, et al. Foun- dation models for decision making: Problems, meth- ods, and opportunities. arXiv preprint arXiv:2303.04129,

work page arXiv
[53]

Coupling large language models with logic program- ming for robust and general reasoning from text

[Yang et al., 2023b] Zhun Yang, Adam Ishay, and Joohyung Lee. Coupling large language models with logic program- ming for robust and general reasoning from text. arXiv preprint arXiv:2307.07696,

work page arXiv
[55]

Keep calm and explore: Language models for action generation in text-based games

[Yao et al., 2020b] Shunyu Yao, Rohan Rao, et al. Keep calm and explore: Language models for action generation in text-based games. arXiv preprint arXiv:2010.02903 ,

work page arXiv 2010
[56]

ReAct: Synergizing Reasoning and Acting in Language Models

[Yao et al., 2022] Shunyu Yao, Jeffrey Zhao, et al. Re- act: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

[Yao et al., 2023] Shunyu Yao, Dian Yu, et al. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

AgentTuning: Enabling generalized agent abilities for LLMs.arXiv preprint arXiv:2310.12823,

[Zeng et al., 2023] Aohan Zeng, Mingdao Liu, et al. Agent- tuning: Enabling generalized agent abilities for llms.arXiv preprint arXiv:2310.12823,

work page arXiv 2023
[59]

Large language model is semi-parametric reinforcement learning agent

[Zhang et al., 2023a] Danyang Zhang, Lu Chen, et al. Large language model is semi-parametric reinforcement learning agent. arXiv preprint arXiv:2306.07929,

work page arXiv
[60]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

[Zhang et al., 2023b] Yue Zhang, Yafu Li, et al. Siren’s song in the ai ocean: A survey on hallucination in large lan- guage models. arXiv preprint arXiv:2309.01219,

work page internal anchor Pith review Pith/arXiv arXiv
[61]

A Survey of Large Language Models

[Zhao et al., 2023a] Wayne Xin Zhao, Kun Zhou, et al. A survey of large language models. arXiv preprint arXiv:2303.18223,

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Advances in Neural Information Processing Systems (NeurIPS) , year =

[Zhao et al., 2023b] Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowl- edge for large-scale task planning. arXiv preprint arXiv:2305.14078,

work page arXiv
[63]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

[Zhong et al., 2023] Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250,

work page internal anchor Pith review arXiv 2023
[64]

WebArena: A Realistic Web Environment for Building Autonomous Agents

[Zhou et al., 2023] Shuyan Zhou, Frank F Xu, et al. We- barena: A realistic web environment for building au- tonomous agents. arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Pddl— the planning domain definition language

[Aeronautiques et al., 1998] Constructions Aeronautiques, Adele Howe, et al. Pddl— the planning domain definition language. Technical Report, Tech. Rep.,

work page 1998

[2] [2]

Learning from mistakes makes llm better reasoner

[An et al., 2023] Shengnan An, Zexiong Ma, et al. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689,

work page arXiv 2023

[3] [3]

AAAI38(2024), https://arxiv.org/abs/2308.09687

[Besta et al., 2023] Maciej Besta, Nils Blach, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687,

work page arXiv 2023

[4] [4]

Recent advances in retrieval-augmented text generation

[Cai et al., 2022] Deng Cai, Yan Wang, Lemao Liu, and Shuming Shi. Recent advances in retrieval-augmented text generation. In SIGIR, pages 3417–3419,

work page 2022

[5] [5]

Evaluating Large Language Models Trained on Code

[Chen et al., 2021b] Mark Chen, Jerry Tworek, et al. Eval- uating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

[Chen et al., 2022] Wenhu Chen, Xueguang Ma, et al. Pro- gram of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Dynamic planning with a llm,

[Dagan et al., 2023] Gautier Dagan, Frank Keller, and Alex Lascarides. Dynamic planning with a llm. arXiv preprint arXiv:2308.06391,

work page arXiv 2023

[8] [8]

Mind2Web: Towards a Generalist Agent for the Web

[Deng et al., 2023] Xiang Deng, Yu Gu, et al. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070,

work page internal anchor Pith review arXiv 2023

[9] [9]

Pal: Program-aided language models

[Gao et al., 2023] Luyu Gao, Aman Madaan, et al. Pal: Program-aided language models. In ICML, pages 10764– 10799,

work page 2023

[10] [10]

Lpg: A planner based on local search for planning graphs with action costs

[Gerevini and Serina, 2002] Alfonso Gerevini and Ivan Se- rina. Lpg: A planner based on local search for planning graphs with action costs. In Aips, volume 2, pages 281– 290,

work page 2002

[11] [11]

Auto- mated Planning: theory and practice

[Ghallab et al., 2004] Malik Ghallab, Dana Nau, et al. Auto- mated Planning: theory and practice. Elsevier,

work page 2004

[12] [12]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

[Gou et al., 2023] Zhibin Gou, Zhihong Shao, et al. Critic: Large language models can self-correct with tool- interactive critiquing. arXiv preprint arXiv:2305.11738 ,

work page internal anchor Pith review arXiv 2023

[13] [13]

Advances in Neural Information Processing Systems (NeurIPS) , year =

[Guan et al., 2023] Lin Guan, Karthik Valmeekam, et al. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. arXiv preprint arXiv:2305.14909,

work page arXiv 2023

[14] [14]

Reasoning with Language Model is Planning with World Model

[Hao et al., 2023] Shibo Hao, Yi Gu, et al. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992,

work page internal anchor Pith review arXiv 2023

[15] [15]

An introduction to the planning domain definition lan- guage, volume

[Haslum et al., 2019] Patrik Haslum, Nir Lipovetzky, et al. An introduction to the planning domain definition lan- guage, volume

work page 2019

[16] [16]

Deep Reinforcement Learning with a Natural Language Action Space

[He et al., 2015] Ji He, Jianshu Chen, et al. Deep reinforce- ment learning with a natural language action space. arXiv preprint arXiv:1511.04636,

work page Pith review arXiv 2015

[17] [17]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

[Huang et al., 2023a] Lei Huang, Yu Weijiang, et al. A sur- vey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Recommender ai agent: Integrating large language models for interactive recommendations.arXiv preprint arXiv:2308.16505, 2023

[Huang et al., 2023b] Xu Huang, Jianxun Lian, et al. Rec- ommender ai agent: Integrating large language mod- els for interactive recommendations. arXiv preprint arXiv:2308.16505,

work page arXiv

[19] [19]

Billion-scale similarity search with GPUs

[Johnson et al., 2019] Jeff Johnson, Matthijs Douze, and Herv´e J ´egou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547,

work page 2019

[20] [20]

Language Models can Solve Computer Tasks

[Kim and others, 2023] Geunwoo Kim et al. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491,

work page internal anchor Pith review arXiv 2023

[21] [21]

Large language models are zero-shot reasoners

[Kojima et al., 2022] Takeshi Kojima, Shixiang Shane Gu, et al. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213,

work page 2022

[22] [22]

Retrieval-augmented generation for knowledge-intensive nlp tasks

[Lewis et al., 2020] Patrick Lewis, Ethan Perez, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33:9459–9474,

work page 2020

[23] [23]

arXiv preprint arXiv:2305.17390 , year=

[Lin et al., 2023] Bill Yuchen Lin, Yicheng Fu, et al. Swift- sage: A generative agent with fast and slow think- ing for complex interactive tasks. arXiv preprint arXiv:2305.17390,

work page arXiv 2023

[24] [24]

Width and inference based planners: Siw, bfs (f), and probe

[Lipovetzky et al., 2014] Nir Lipovetzky, Miquel Ramirez, et al. Width and inference based planners: Siw, bfs (f), and probe. IPC, page 43,

work page 2014

[25] [25]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

[Liu et al., 2023a] Bo Liu, Yuqian Jiang, et al. Llm+ p: Em- powering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477,

work page internal anchor Pith review arXiv

[26] [26]

arXiv preprint arXiv:2311.08719 , year=

[Liu et al., 2023b] Lei Liu, Xiaoyan Yang, et al. Think-in- memory: Recalling and post-thinking enable llms with long-term memory. arXiv preprint arXiv:2311.08719 ,

work page arXiv

[27] [27]

AgentBench: Evaluating LLMs as Agents

[Liu et al., 2023c] Xiao Liu, Hao Yu, et al. Agent- bench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Self-Refine: Iterative Refinement with Self-Feedback

[Madaan et al., 2023] Aman Madaan, Niket Tandon, , et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Generation-augmented retrieval for open-domain question answering

[Mao et al., 2020] Yuning Mao, Pengcheng He, Liu, et al. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553,

work page arXiv 2020

[30] [30]

MemGPT: Towards LLMs as Operating Systems

[Packer et al., 2023] Charles Packer, Vivian Fang, et al. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Unifying large language models and knowledge graphs: A roadmap

[Pan et al., 2024] Shirui Pan, Linhao Luo, et al. Unifying large language models and knowledge graphs: A roadmap. TKDE,

work page 2024

[32] [32]

Generative agents: Interactive simulacra of human behav- ior

[Park et al., 2023] Joon Sung Park, Joseph O’Brien, et al. Generative agents: Interactive simulacra of human behav- ior. In SUIST, pages 1–22,

work page 2023

[33] [33]

Tool Learning with Foundation Models

[Qin et al., 2023] Yujia Qin, Shengding Hu, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354,

work page internal anchor Pith review arXiv 2023

[34] [34]

Cognitive task analysis

[Schraagen et al., 2000] Jan Maarten Schraagen, Susan F Chipman, et al. Cognitive task analysis. Psychology Press,

work page 2000

[35] [35]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

[Shen et al., 2023] Yongliang Shen, Kaitao Song, et al. Hug- ginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580,

work page internal anchor Pith review arXiv 2023

[36] [36]

Reflexion: Language agents with verbal reinforcement learning

[Shinn et al., 2023] Noah Shinn, Federico Cassano, et al. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS,

work page 2023

[37] [37]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

[Shridhar et al., 2020] Mohit Shridhar, Xingdi Yuan, et al. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768 ,

work page internal anchor Pith review Pith/arXiv arXiv 2020

[38] [38]

Prog- prompt: Generating situated robot task plans using large language models

[Singh et al., 2023] Ishika Singh, Valts Blukis, et al. Prog- prompt: Generating situated robot task plans using large language models. In ICRA 2023 , pages 11523–11530. IEEE,

work page 2023

[39] [39]

A survey of reasoning with foundation models

[Sun et al., 2023] Jiankai Sun, Chuanyang Zheng, et al. A survey of reasoning with foundation models. arXiv preprint arXiv:2312.11562,

work page arXiv 2023

[40] [40]

FEVER: a large-scale dataset for Fact Extraction and VERification

[Thorne et al., 2018] James Thorne, Andreas Vlachos, et al. Fever: a large-scale dataset for fact extraction and verifi- cation. arXiv preprint arXiv:1803.05355,

work page internal anchor Pith review arXiv 2018

[41] [41]

Llama 2: Open Foundation and Fine-Tuned Chat Models

[Touvron et al., 2023] Hugo Touvron, Louis Martin, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Science- World: Is your Agent Smarter than a 5th Grader?, 2022

[Wang et al., 2022a] Ruoyao Wang, Peter Jansen, et al. Sci- enceworld: Is your agent smarter than a 5th grader? arXiv preprint arXiv:2203.07540,

work page arXiv

[43] [43]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

[Wang et al., 2022b] Xuezhi Wang, Jason Wei, et al. Self- consistency improves chain of thought reasoning in lan- guage models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

A Survey on Large Language Model based Autonomous Agents

[Wang et al., 2023a] Lei Wang, Chen Ma, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432,

work page internal anchor Pith review arXiv

[45] [45]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

[Wang et al., 2023b] Lei Wang, Wanyu Xu, et al. Plan-and- solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091,

work page internal anchor Pith review arXiv

[46] [46]

Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

[Wang et al., 2023c] Yancheng Wang, Ziyan Jiang, et al. Recmind: Large language model powered agent for rec- ommendation. arXiv preprint arXiv:2308.14296,

work page arXiv

[47] [47]

Chain- of-thought prompting elicits reasoning in large language models

[Wei et al., 2022] Jason Wei, Xuezhi Wang, et al. Chain- of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837,

work page 2022

[48] [48]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

[Wu et al., 2023] Chenfei Wu, Shengming Yin, et al. Visual chatgpt: Talking, drawing and editing with visual founda- tion models. arXiv preprint arXiv:2303.04671,

work page internal anchor Pith review arXiv 2023

[49] [49]

C-pack: Packaged resources to advance general chinese embedding,

[Xiao and others, 2023] Shitao Xiao et al. C-pack: Packaged resources to advance general chinese embedding,

work page 2023

[50] [50]

Llm a*: Human in the loop large language models enabled a* search for robotics

[Xiao and Wang, 2023] Hengjia Xiao and Peng Wang. Llm a*: Human in the loop large language models enabled a* search for robotics. arXiv preprint arXiv:2312.01797,

work page arXiv 2023

[51] [51]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

[Yang et al., 2018] Zhilin Yang, Peng Qi, et al. Hotpotqa: A dataset for diverse, explainable multi-hop question an- swering. arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[52] [52]

Foundation models for decision making: Problems, methods, and opportunities, 2023

[Yang et al., 2023a] Sherry Yang, Nachum Ofir, et al. Foun- dation models for decision making: Problems, meth- ods, and opportunities. arXiv preprint arXiv:2303.04129,

work page arXiv

[53] [53]

Coupling large language models with logic program- ming for robust and general reasoning from text

[Yang et al., 2023b] Zhun Yang, Adam Ishay, and Joohyung Lee. Coupling large language models with logic program- ming for robust and general reasoning from text. arXiv preprint arXiv:2307.07696,

work page arXiv

[54] [55]

Keep calm and explore: Language models for action generation in text-based games

[Yao et al., 2020b] Shunyu Yao, Rohan Rao, et al. Keep calm and explore: Language models for action generation in text-based games. arXiv preprint arXiv:2010.02903 ,

work page arXiv 2010

[55] [56]

ReAct: Synergizing Reasoning and Acting in Language Models

[Yao et al., 2022] Shunyu Yao, Jeffrey Zhao, et al. Re- act: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[56] [57]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

[Yao et al., 2023] Shunyu Yao, Dian Yu, et al. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [58]

AgentTuning: Enabling generalized agent abilities for LLMs.arXiv preprint arXiv:2310.12823,

[Zeng et al., 2023] Aohan Zeng, Mingdao Liu, et al. Agent- tuning: Enabling generalized agent abilities for llms.arXiv preprint arXiv:2310.12823,

work page arXiv 2023

[58] [59]

Large language model is semi-parametric reinforcement learning agent

[Zhang et al., 2023a] Danyang Zhang, Lu Chen, et al. Large language model is semi-parametric reinforcement learning agent. arXiv preprint arXiv:2306.07929,

work page arXiv

[59] [60]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

[Zhang et al., 2023b] Yue Zhang, Yafu Li, et al. Siren’s song in the ai ocean: A survey on hallucination in large lan- guage models. arXiv preprint arXiv:2309.01219,

work page internal anchor Pith review Pith/arXiv arXiv

[60] [61]

A Survey of Large Language Models

[Zhao et al., 2023a] Wayne Xin Zhao, Kun Zhou, et al. A survey of large language models. arXiv preprint arXiv:2303.18223,

work page internal anchor Pith review Pith/arXiv arXiv

[61] [62]

Advances in Neural Information Processing Systems (NeurIPS) , year =

[Zhao et al., 2023b] Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowl- edge for large-scale task planning. arXiv preprint arXiv:2305.14078,

work page arXiv

[62] [63]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

[Zhong et al., 2023] Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250,

work page internal anchor Pith review arXiv 2023

[63] [64]

WebArena: A Realistic Web Environment for Building Autonomous Agents

[Zhou et al., 2023] Shuyan Zhou, Frank F Xu, et al. We- barena: A realistic web environment for building au- tonomous agents. arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023