pith. machine review for the scientific record. sign in

arxiv: 2305.17144 · v2 · submitted 2023-05-25 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Recognition: 3 theorem links

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG
keywords MinecraftLarge Language ModelsOpen-world environmentsAgent planningText-based memoryGenerally capable agentsReinforcement learning alternativesLong-horizon tasks
0
0 comments X

The pith

Large language models with text memory and knowledge let agents complete Minecraft's full Overworld item tree for the first time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GITM, a framework that pairs large language models with text descriptions of game state and memory to generate action plans for agents in Minecraft. It shows these agents can handle long-horizon tasks in sparse-reward open worlds without any reinforcement learning training. The resulting system raises success on the ObtainDiamond task by 47.5 percentage points over prior methods and becomes the first to collect every item along the Overworld technology tree. A reader would care because the work suggests that an LLM's built-in logic can replace expensive policy training for building competent agents in complex, uncertain settings. The approach runs on a single CPU node and points toward simpler ways to create general agents that adapt to new goals.

Core claim

The paper claims that large language models equipped with structured actions, text-based knowledge, and memory can produce generally capable agents that navigate open-world Minecraft environments. By prompting the LLM to output action plans from current text state and past memory, the agent achieves a 47.5 percent higher success rate on ObtainDiamond than previous reinforcement-learning controllers and becomes the first to obtain every item in the Overworld technology tree. The method requires no GPU training and runs on a single 32-core CPU node.

What carries the argument

The GITM framework, which uses an LLM to convert text-based game state and memory into sequences of structured actions that the agent then executes in the environment.

If this is right

  • The agent completes the entire Overworld technology tree, a result not previously achieved.
  • Success rate on the ObtainDiamond task rises 47.5 percentage points over prior reinforcement-learning methods.
  • The system runs without GPU training on a single CPU node with 32 cores.
  • The same text-based planning approach shows robustness across a broader set of open-world tasks beyond single objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same text-state and memory mechanism could be tested in other simulated environments that expose state through language rather than pixels.
  • Combining the LLM planner with occasional visual feedback might improve robustness when text descriptions become ambiguous.
  • If the approach scales, it reduces the need to train separate policies for each new game or task, shifting effort toward prompt design and memory management.
  • One could measure whether the same LLM agent maintains performance when the Minecraft world is altered with new blocks or rules not seen in training data.

Load-bearing premise

Large language models already contain enough logic and common sense to produce reliable long-horizon action plans when supplied only with text descriptions of state and memory.

What would settle it

A controlled test in which the LLM agent is given text state and memory for a new long-horizon Minecraft goal and either fails to complete it or matches the success rate of a trained RL baseline rather than exceeding it by a large margin.

read the original abstract

The captivating realm of Minecraft has attracted substantial research interest in recent years, serving as a rich platform for developing intelligent agents capable of functioning in open-world environments. However, the current research landscape predominantly focuses on specific objectives, such as the popular "ObtainDiamond" task, and has not yet shown effective generalization to a broader spectrum of tasks. Furthermore, the current leading success rate for the "ObtainDiamond" task stands at around 20%, highlighting the limitations of Reinforcement Learning (RL) based controllers used in existing methods. To tackle these challenges, we introduce Ghost in the Minecraft (GITM), a novel framework integrates Large Language Models (LLMs) with text-based knowledge and memory, aiming to create Generally Capable Agents (GCAs) in Minecraft. These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate on the "ObtainDiamond" task, demonstrating superior robustness compared to traditional RL-based controllers. Notably, our agent is the first to procure all items in the Minecraft Overworld technology tree, demonstrating its extensive capabilities. GITM does not need any GPU for training, but a single CPU node with 32 CPU cores is enough. This research shows the potential of LLMs in developing capable agents for handling long-horizon, complex tasks and adapting to uncertainties in open-world environments. See the project website at https://github.com/OpenGVLab/GITM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GITM, a framework integrating large language models with text-based knowledge and memory to create generally capable agents for open-world Minecraft. It claims a +47.5% success-rate improvement on the ObtainDiamond task over prior RL methods, reports being the first agent to complete the full Overworld technology tree, and emphasizes that the approach requires no GPU training.

Significance. If the empirical claims hold under rigorous controls, the work is significant for demonstrating that LLM-driven planning from text state and memory can address long-horizon sparse-reward tasks in open worlds without task-specific training, offering a scalable alternative to RL controllers and enabling broader generalization across Minecraft objectives.

major comments (3)
  1. [Results section] Results section (ObtainDiamond paragraph): the +47.5% success-rate gain is stated without error bars, standard deviations, or the number of independent episodes/seeds, which is load-bearing for the central claim that the LLM-based agent markedly surpasses previous methods.
  2. [Methods section] Methods section (action planning and prompting): exact prompt templates, the precise format of the text-based state/memory input, and the decision procedure for invoking the LLM are not supplied, preventing attribution of performance to LLM logic versus the structured action interface.
  3. [Experiments section] Experiments section (tech-tree and ablation): no ablation is reported that removes the LLM planner while retaining the same structured actions and memory scaffolding, leaving open whether the reported tech-tree completion and ObtainDiamond gains require the LLM or could be achieved by simpler scripted policies.
minor comments (2)
  1. [Abstract and Results] The manuscript references a GitHub project page but does not include a table explicitly comparing GITM against every cited baseline on the full set of tasks, which would strengthen the generalization claim.
  2. [Methods] Notation for the structured actions (e.g., parameter lists and preconditions) could be illustrated with one concrete example in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will update the manuscript to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Results section] Results section (ObtainDiamond paragraph): the +47.5% success-rate gain is stated without error bars, standard deviations, or the number of independent episodes/seeds, which is load-bearing for the central claim that the LLM-based agent markedly surpasses previous methods.

    Authors: We agree that statistical details are necessary to support the central performance claim. In the revised manuscript, we will report the number of independent episodes (100 trials per method), include standard deviations for all success rates, and add error bars to the ObtainDiamond results figure. The +47.5% figure is the difference between mean success rates of GITM and the strongest prior RL baseline. revision: yes

  2. Referee: [Methods section] Methods section (action planning and prompting): exact prompt templates, the precise format of the text-based state/memory input, and the decision procedure for invoking the LLM are not supplied, preventing attribution of performance to LLM logic versus the structured action interface.

    Authors: We will add the exact prompt templates to an appendix. We will also describe the precise text format for state and memory inputs and clarify the invocation procedure (LLM is called at each planning step to output the next action sequence given current state, memory, and task). This will make the contribution of the LLM explicit. revision: yes

  3. Referee: [Experiments section] Experiments section (tech-tree and ablation): no ablation is reported that removes the LLM planner while retaining the same structured actions and memory scaffolding, leaving open whether the reported tech-tree completion and ObtainDiamond gains require the LLM or could be achieved by simpler scripted policies.

    Authors: This is a fair criticism. We will add a discussion paragraph explaining why a purely scripted policy using the same actions would be insufficient for long-horizon generalization across the full tech tree. If space and time permit, we will also report a simple rule-based baseline comparison; otherwise we will note this as a limitation and leave a full ablation for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical LLM framework with external performance claims

full rationale

The paper presents an empirical agent architecture that combines pre-trained LLMs with text-based memory and structured actions for Minecraft tasks. No mathematical derivations, equations, or fitted parameters are introduced whose outputs are then renamed as predictions. Success-rate improvements and technology-tree completion are reported from experimental runs rather than any self-referential construction. Any self-citations present are not load-bearing for a derivation chain and do not reduce the central claims to unverified author priors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes LLMs already encode the necessary planning logic and that a fixed set of structured actions is sufficient to express any Minecraft behavior.

axioms (1)
  • domain assumption Large language models possess sufficient logic and common sense for Minecraft planning when given text state and memory.
    Invoked in the description of how the agent generates action plans.

pith-pipeline@v0.9.0 · 5663 in / 1151 out tokens · 23892 ms · 2026-05-15T18:32:52.638408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LogicAsFunctionalEquation laws_of_logic_imply_dalembert_hypotheses echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions.

  • HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate on the 'ObtainDiamond' task... our agent is the first to procure all items in the Minecraft Overworld technology tree.

  • DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    GITM does not need any GPU for training, but a single CPU node with 32 CPU cores is enough.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

    cs.CL 2026-05 unverdicted novelty 7.0

    An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

  2. Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

    cs.AI 2026-05 unverdicted novelty 7.0

    AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.

  3. OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

    cs.CL 2026-04 unverdicted novelty 7.0

    OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

  4. PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.

  5. 2.5-D Decomposition for LLM-Based Spatial Construction

    cs.AI 2026-05 unverdicted novelty 6.0

    2.5-D decomposition lets LLMs achieve 94.6% structural accuracy on a building benchmark by handling only horizontal planning while a symbolic system manages vertical placements from occupancy.

  6. Long-Term Memory for VLA-based Agents in Open-World Task Execution

    cs.RO 2026-04 unverdicted novelty 6.0

    ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.

  7. SkillDroid: Compile Once, Reuse Forever

    cs.HC 2026-04 conditional novelty 6.0

    SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 r...

  8. RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game...

  9. MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with Large Language Models

    cs.SE 2026-04 unverdicted novelty 6.0

    MIMIC-Py provides a modular Python framework that turns personality-driven LLM agents into an extensible system for automated game testing via configurable traits, decoupled components, and multiple interaction methods.

  10. HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

    cs.AI 2026-03 unverdicted novelty 6.0

    HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.

  11. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  12. Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

    cs.AI 2026-05 unverdicted novelty 5.0

    AIDA is a reinforcement learning agent that explores complex business databases using a proprietary DSL and Pareto-guided reasoning to discover actionable insights autonomously.

  13. Gated Coordination for Efficient Multi-Agent Collaboration in Minecraft Game

    cs.MA 2026-04 unverdicted novelty 5.0

    Gated escalation and partitioned states enable more efficient multi-agent collaboration in Minecraft by making communication selective rather than automatic.

  14. Experience Transfer for Multimodal LLM Agents in Minecraft Game

    cs.AI 2026-04 unverdicted novelty 5.0

    Echo framework enables experience transfer for multimodal LLM agents in Minecraft by decomposing knowledge into structure, attribute, process, function, and interaction dimensions and applying in-context analogy learn...

  15. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  16. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    cs.CL 2023-05 conditional novelty 5.0

    Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.

  17. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  18. Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 4.0

    SuperIgor uses iterative co-training of a language model planner and a goal-conditional RL agent to self-generate and refine plans, resulting in stricter instruction adherence and better generalization to unseen instructions.

  19. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  20. A Survey on the Memory Mechanism of Large Language Model based Agents

    cs.AI 2024-04 accept novelty 3.0

    A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 19 Pith papers · 8 internal anchors

  1. [1]

    Amiranashvili, N

    A. Amiranashvili, N. Dorka, W. Burgard, V . Koltun, and T. Brox. Scaling imitation learning in minecraft. arXiv preprint arXiv:2007.02701, 2020

  2. [2]

    Baker, I

    B. Baker, I. Akkaya, P. Zhokov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022

  3. [3]

    S. Cai, Z. Wang, X. Ma, A. Liu, and Y . Liang. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. arXiv preprint arXiv:2301.10034, 2023

  4. [4]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  5. [5]

    L. Fan, G. Wang, Y . Jiang, A. Mandlekar, Y . Yang, H. Zhu, A. Tang, D.-A. Huang, Y . Zhu, and A. Anand- kumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv:2206.08853, 2022

  6. [6]

    W. H. Guss, S. Milani, N. Topin, B. Houghton, S. Mohanty, A. Melnik, A. Harter, B. Buschmaas, B. Jaster, C. Berganski, et al. Towards robust and domain agnostic reinforcement learning competitions: Minerl

  7. [7]

    PMLR, 2021

    In NeurIPS 2020 Competition and Demonstration Track, pages 233–252. PMLR, 2021

  8. [8]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  9. [9]

    Huang, P

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022

  10. [10]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022

  11. [11]

    Kanervisto, S

    A. Kanervisto, S. Milani, K. Ramanauskas, N. Topin, Z. Lin, J. Li, J. Shi, D. Ye, Q. Fu, W. Yang, et al. Minerl diamond 2021 competition: Overview, results, and lessons learned. NeurIPS 2021 Competitions and Demonstrations Track, pages 13–28, 2022

  12. [12]

    Kanitscheider, J

    I. Kanitscheider, J. Huizinga, D. Farhi, W. H. Guss, B. Houghton, R. Sampedro, P. Zhokhov, B. Baker, A. Ecoffet, J. Tang, et al. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft. arXiv preprint arXiv:2106.14876, 2021

  13. [13]

    M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y . Li. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023

  14. [14]

    Code as Policies: Language Model Programs for Embodied Control

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022

  15. [15]

    Z. Lin, J. Li, J. Shi, D. Ye, Q. Fu, and W. Yang. Juewu-mc: Playing minecraft with sample-efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907, 2021. 10

  16. [16]

    H. Mao, C. Wang, X. Hao, Y . Mao, Y . Lu, C. Wu, J. Hao, D. Li, and P. Tang. Seihai: A sample-efficient hierarchical ai for the minerl competition. In Distributed Artificial Intelligence: Third International Conference, DAI 2021, Shanghai, China, December 17–18, 2021, Proceedings 3, pages 38–51. Springer, 2022

  17. [17]

    Matsuo, Y

    Y . Matsuo, Y . LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Morimoto. Deep learning, reinforcement learning, and world models. Neural Networks, 2022

  18. [18]

    Milani, N

    S. Milani, N. Topin, B. Houghton, W. H. Guss, S. P. Mohanty, K. Nakata, O. Vinyals, and N. S. Kuno. Retrospective analysis of the 2019 minerl competition on sample efficient reinforcement learning. In NeurIPS 2019 Competition and Demonstration Track, pages 203–214. PMLR, 2020

  19. [19]

    Milani, A

    S. Milani, A. Kanervisto, K. Ramanauskas, S. Schulhoff, B. Houghton, S. Mohanty, B. Galbraith, K. Chen, Y . Song, T. Zhou, et al. Towards solving fuzzy tasks with human feedback: A retrospective of the minerl basalt 2022 competition. arXiv preprint arXiv:2303.13512, 2023

  20. [20]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

  21. [21]

    Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023

  22. [22]

    Singh, V

    I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022

  23. [23]

    Skrynnik, A

    A. Skrynnik, A. Staroverov, E. Aitygulov, K. Aksenov, V . Davydov, and A. I. Panov. Hierarchical deep q-network from imperfect demonstrations in minecraft. Cognitive Systems Research, 65:74–78, 2021

  24. [24]

    O. E. L. Team, A. Stooke, A. Mahajan, C. Barros, C. Deck, J. Bauer, J. Sygnowski, M. Trebacz, M. Jaderberg, M. Mathieu, et al. Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021

  25. [25]

    Tessler, S

    C. Tessler, S. Givony, T. Zahavy, D. Mankowitz, and S. Mannor. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of the AAAI conference on artificial intelligence, 2017

  26. [26]

    Z. Wang, S. Cai, A. Liu, X. Ma, and Y . Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023

  27. [27]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  28. [28]

    Minecraft — Wikipedia, the free encyclopedia

    Wikipedia contributors. Minecraft — Wikipedia, the free encyclopedia. https://en.wikipedia.org/ w/index.php?title=Minecraft&oldid=1155148900, 2023

  29. [29]

    SYSTEM” and query with the role of “USER

    H. Yuan, C. Zhang, H. Wang, F. Xie, P. Cai, H. Dong, and Z. Lu. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023. A Implementation Details A.1 LLM Decomposer We use gpt-3.5-turbo from OpenAI API 4 for goal decomposition. The prompt is shown as follows, which consists of two parts: ins...

  30. [30]

    You must generate the goal based on the provided knowledge instead of purely depending on your own knowledge

  31. [31]

    object":

    The "info" should be as compact as possible, at most 3 sentences. The knowledge I give you may be raw texts from Wiki documents. Please extract and summarize important information instead of directly copying all the texts. Goal Example: { "object": "iron_ore", "count": 1, "material": None, "tool": "stone_pickaxe", "info": "iron ore is obtained by mining i...

  32. [32]

    The plan tree should be exactly of depth 2

  33. [33]

    Describe each step in one line

  34. [34]

    You should index the two levels like ’1.’, ’1.1.’, ’1.2.’, ’2.’, ’2.1.’, etc

  35. [35]

    SYSTEM” role to extract the tuple, and add the sentence to the content of “USER

    The sub-goals at the bottom level should be basic actions so that I can easily execute them in the game. USER: The goal is to {goal description }. Generate the plan according to the requirements. After that, we extract the action tuple from each sentence of the leaf nodes. We use the following instruction as the content of “SYSTEM” role to extract the tup...

  36. [36]

    Don’t make plans purely based on your experience, think about how to use these functions

    You can only use the following functions. Don’t make plans purely based on your experience, think about how to use these functions. explore(object, strategy) Move around to find the object with the strategy: used to find objects including block items and entities. This action is finished once the object is visible (maybe at the distance). Augments: - obje...

  37. [37]

    Generated structures

    You cannot define any new function. Note that the "Generated structures" world creation option is turned off

  38. [38]

    It is not an entity, but objects can be added to it or retrieved from it anytime at anywhere without specific actions

    There is an inventory that stores all the objects I have. It is not an entity, but objects can be added to it or retrieved from it anytime at anywhere without specific actions. The mined or crafted objects will be added to this inventory, and the materials and tools to use are also from this inventory. Objects in the inventory can be directly used. Don’t ...

  39. [39]

    Choose the easiest way to achieve the goal conditioned on my current state

    You will get the following information about my current state: - inventory: a dict representing the inventory mentioned above, whose keys are the name of the objects and the values are their quantities - environment: a string including my surrounding biome, the y-level of my current location, and whether I am on the ground or underground Pay attention to ...

  40. [40]

    explanation

    You must describe your thoughts on the plan in natural language at the beginning. After that, you should write all the actions together. The response should follow the format: { "explanation": "explain why the last action failed, set to null for the first planning", "thoughts": "Your thoughts on the plan in natural languag", "action_list": [ {"name": "act...

  41. [41]

    If some action fails, I will stop at that action and will not execute its following actions

    I will execute your code step by step and give you feedback. If some action fails, I will stop at that action and will not execute its following actions. The feedback will include error messages about the failed action. At that time, you should replan and write the new code just starting from that failed action. A.3.2 User Query USER: My current state: - ...

  42. [42]

    Each action sequence is a sequence of the following actions: {action description}

  43. [43]

    I will describe the state in the following form: State: - inventory: a dict whose keys are the name of the objects and the values are their quantities

    The action sequences before and after summarization are always conditioned on the given state, i.e., the actions are taken in that certain state to achieve the goal. I will describe the state in the following form: State: - inventory: a dict whose keys are the name of the objects and the values are their quantities. This inventory stores all the objects I...

  44. [44]

    Every necessary action should be included, even though it does not appear in some sequences because I manually skipped it in some lucky cases

    The action sequence you summarize should be able to achieve the goal in general cases without specific modification. Every necessary action should be included, even though it does not appear in some sequences because I manually skipped it in some lucky cases. The actions redundant or irrelevant to the goal should be filtered out. The corner cases, such as...

  45. [45]

    Your thoughts and descriptions of your summarization

    You should describe your thoughts on summarization in natural language at the beginning. After that, give me the summarized action sequence as a list in JSON format. Your response should follow this form: Thoughts: "Your thoughts and descriptions of your summarization" Summarized action sequence: [ {"name": "action name", "args": {"arg name": value}, "exp...