pith. sign in

arxiv: 2605.26637 · v1 · pith:IW56MS3Cnew · submitted 2026-05-26 · 💻 cs.RO

Enabling Extensible Embodied Capabilities with Tools

Pith reviewed 2026-06-29 17:23 UTC · model grok-4.3

classification 💻 cs.RO
keywords embodied AItool usecapability externalizationrobotics benchmarksmodular policiesperception and cognitiontask planning
0
0 comments X

The pith

Decoupling embodied skills into external tools improves task performance by 31 percent on average while exposing limits in execution capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that unified models cannot reliably handle the mix of perception, reasoning, planning and control required for embodied tasks because those skills differ in structure and demand. Instead, it separates the skills into a library of independent tools that a model can discover and call at runtime through a new protocol. Experiments on navigation and household benchmarks show consistent gains from this separation, especially in cognitive steps, but smaller benefits when the tool must drive physical actions. The work also documents that current models still fail at deciding when and how to use the tools, turning tool competence itself into a measurable bottleneck.

Core claim

Capability externalization, achieved by registering heterogeneous skills as independently optimized tools under the Embodied Tool Protocol and invoking them dynamically, produces average performance gains of 31 percent on EB-ALFRED and 36 percent on EB-Navigation; the gains are large for perception and cognition tools yet remain limited for execution tools, and models across families continue to struggle with tool-necessity recognition, selection, execution, and chain composition.

What carries the argument

Embodied Tool Protocol (ETP), a standardized interface for tool registration, discovery, invocation, and execution that allows heterogeneous capabilities to be maintained and called outside a single policy network.

If this is right

  • Tool use produces larger gains for cognition and perception than for execution-type capabilities.
  • A persistent gap remains in models' ability to recognize tool need, choose the right tool, execute it correctly, and compose tool chains.
  • The approach is validated across both simulation and real-world robot platforms.
  • Over 100 validated tools spanning perception, cognition, reasoning, and execution form a reusable base for future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The boundary between cognitive and execution gains suggests that future tool design should prioritize tighter integration with low-level controllers rather than treating execution as just another callable module.
  • If tool-invocation competence improves, the same externalization method could extend to longer-horizon tasks where unified models currently plateau.
  • The benchmark results imply that progress on tool-use reasoning may now be a higher-leverage research target than further scaling of monolithic embodied policies.

Load-bearing premise

Heterogeneous capabilities can be reliably split into separate tools that a model invokes at inference time without losing coherence or adding major overhead.

What would settle it

Running the same EB-ALFRED and EB-Navigation tasks with the full tool set and finding no net improvement over a unified baseline policy, or finding measurable coherence loss during dynamic tool calls, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.26637 by Guiyao Tie, Lichao Sun, Li Wan, Pan Zhou, Qianjiang Li, Xueyang Zhou, Yibo Hu, Yidan Liu, Yongchao Chen, Zijia Wang.

Figure 1
Figure 1. Figure 1: Overview of the EmbodiedTool. Tool-augmented decision process. At each step t, the agent selects a tool gt ∈ Z¯ := Z ∪ {⊥} (where ⊥ denotes invoking no tool), queries it with a generated query qθ(τt, l, gt), and conditions its action on the returned observation yt. Formally, the three-stage transition is: gt ∼ µθ(· | τt, l), yt := Tgt (qθ(τt, l, gt)), at ∼ πθ(· | τt, l, yt). (3) Bi-level optimization. Lear… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the EmbodiedToolBench collection process. Tool as a capability unit. ETP treats each embodied capability as a callable unit with a declared interface. Formally, a tool zm is characterized by its input–output spaces (Xm, Ym), a realized capability subset C(zm) ⊆ C, and an executable mapping fm(·; ϕm) : Xm → Ym. This interface contract separates capability from implementation: fm can be instantia… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of the real-world robot tasks [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of EmbodiedToolBench across capability dimensions. 4.4 Inference Time Overhead Analysis [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Error analysis on EmbodiedBench. 5 Related Work Embodied agents require heterogeneous capabilities, including perception, reasoning, planning, control, memory, and adaptation, to operate in open-world environments [23, 37, 42, 50]. Prior work enhances these capabilities through hierarchical decision-making [1, 17, 39], scene and spatial representations [5, 7, 18, 35, 44], and integrated policy, planning, a… view at source ↗
Figure 6
Figure 6. Figure 6: Embodied tool embedding visualization. Distractor Difficulty Control. To increase the difficulty of negative instances, we deliberately introduce tool-inducing negative samples: these samples contain task descriptions with keywords strongly associated with tool functionality (e.g., “detect”, “grasp”, “navigate to”), yet based on the current observation and interaction history, no tool invocation is actuall… view at source ↗
Figure 7
Figure 7. Figure 7: summarizes the composition of our collected tool pool. In total, our collection comprises 112 tools distributed across four macro capability groups, spanning the full pipeline of an embodied agent: perception and grounding (36 tools), cognition and state modeling (25), reasoning and planning (27), and execution and control (24). Crucially, no single stage dominates the collection. The four groups are broad… view at source ↗
Figure 8
Figure 8. Figure 8: further illustrates the dataset distribution across difficulty levels and embodied environments. The difficulty profiles vary meaningfully across tasks, reflecting the distinct cognitive demands of each. Tool-Need Recognition is heavily skewed toward easy instances (86%), consistent with its binary decision nature. In contrast, Tool Selection achieves a well-balanced distribution across easy, medium, and h… view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the high-performance robotic hardware platform. We conduct real-world experiments on a tabletop robotic manipulation platform equipped with a 6-DoF bus-servo robotic arm, a parallel robotic gripper, and a Gem￾ini Plus RGB-D camera for visual and depth perception. The camera provides synchronized RGB and depth observations within a working range of 0.25–2.5 m, en￾abling object localization and s… view at source ↗
Figure 10
Figure 10. Figure 10: Illustrative example from the tools [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Illustrative example from the tools. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Illustrative example from the tools [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Illustrative example from the tools. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Illustrative examples from the tool-awareness evaluation [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Illustrative examples from the tool-selection evaluation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Illustrative examples from the tool-usage evaluation [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Illustrative examples from the tool-chain composition evaluation. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: Illustrative examples from the tool-call evaluation. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Illustrative examples from the tool-call evaluation. I Prompts This section provides the prompt templates used by the three experimental environment modules in EMBODIEDTOOLBENCH: EB-Habitat, EB-ALFRED, and EB-Navigation. The purpose is to document the exact agent instructions used in our experiments, including action-space assumptions, planning constraints, tool-use rules, history and feedback conditionin… view at source ↗
read the original abstract

Most existing embodied intelligence methods formulate perception, reasoning, planning, and control within a unified parameterized policy. Yet these capabilities are inherently hierarchical and heterogeneous, making them difficult to reliably learn and modularize within a single model. We propose a capability externalization approach that decouples heterogeneous capabilities into independently optimized tools, dynamically invoked at inference time. To this end, we introduce Embodied Tool Protocol (ETP), a standardized protocol for embodied tool registration, discovery, invocation, and execution, and curate 100+ validated tools spanning perception, cognition, reasoning, and execution as the tool base. Building on this, we construct EmbodiedToolBench to evaluate both whether tool augmentation improves embodied performance and how well current models use tools across tool-necessity recognition, tool selection, tool execution, and tool-chain composition. Experiments across simulation and real-world platforms confirm that capability externalization consistently improves embodied performance (avg. gain 31% on EB-ALFRED and 36% on EB-Navigation), yet reveal a clear boundary: gains are substantial for cognition and perception but are limited for execution-type capabilities. Moreover, our analysis reveals that knowing when, which, and how to invoke tools remains a persistent challenge across all models, thereby highlighting embodied tool competence as a critical direction for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that by decoupling heterogeneous embodied capabilities into independently optimized tools using the Embodied Tool Protocol (ETP) and curating 100+ tools, embodied performance can be consistently improved, with average gains of 31% on EB-ALFRED and 36% on EB-Navigation. Gains are substantial for cognition and perception but limited for execution-type capabilities. The paper introduces EmbodiedToolBench to evaluate tool use in terms of necessity recognition, selection, execution, and tool-chain composition, and notes that knowing when, which, and how to invoke tools remains a challenge.

Significance. If the results are substantiated, this work could be significant for advancing modular embodied AI by showing the benefits of capability externalization. The introduction of a standardized protocol (ETP) and a new benchmark (EmbodiedToolBench) are valuable for the field, enabling future research on tool-augmented systems. The empirical distinction between capability types where externalization helps is a useful finding.

major comments (2)
  1. [Abstract] The reported average gains of 31% on EB-ALFRED and 36% on EB-Navigation are presented without error bars, dataset details, baseline comparisons, or tool validation procedure, which is load-bearing for the central claim of consistent improvement from tool externalization.
  2. [Abstract] The assumption that heterogeneous capabilities can be decoupled into independently optimized tools dynamically invoked at inference without significant integration overhead or loss of coherence lacks quantitative validation of tool independence or invocation overhead; this is central to the proposed approach.
minor comments (2)
  1. The abstract refers to 'simulation and real-world platforms' without specifying which ones; this should be detailed in the experiments section for clarity.
  2. Ensure consistent use of terms like 'Embodied Tool Protocol (ETP)' and 'EmbodiedToolBench' throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of capability externalization via ETP and EmbodiedToolBench. We address each major comment below with targeted revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] The reported average gains of 31% on EB-ALFRED and 36% on EB-Navigation are presented without error bars, dataset details, baseline comparisons, or tool validation procedure, which is load-bearing for the central claim of consistent improvement from tool externalization.

    Authors: The full manuscript reports error bars in Tables 2 and 3, dataset splits and sizes in Section 4.1, baseline comparisons (including ablations) in Section 5.2, and tool validation criteria in Appendix B. The abstract is intentionally concise, but we agree it should better contextualize the central claim. We will revise the abstract to include a parenthetical note on statistical reporting and direct readers to the relevant sections. revision: yes

  2. Referee: [Abstract] The assumption that heterogeneous capabilities can be decoupled into independently optimized tools dynamically invoked at inference without significant integration overhead or loss of coherence lacks quantitative validation of tool independence or invocation overhead; this is central to the proposed approach.

    Authors: The end-to-end gains on EB-ALFRED, EB-Navigation, and real-world platforms (Section 6) already demonstrate practical feasibility of dynamic invocation. Tool independence is evidenced by the modular ETP design allowing arbitrary tool substitution without base-model retraining. We acknowledge the value of explicit overhead metrics and will add a short quantitative analysis of invocation latency and coherence (drawn from our existing logs) in a new paragraph of Section 3.3. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks and tool curation, not self-definition

full rationale

The paper introduces a protocol and tool base, then reports measured performance deltas on EB-ALFRED and EB-Navigation. No equations, fitted parameters, or derivations appear; the central claim (gains from externalization) is presented as an experimental outcome rather than a quantity defined in terms of itself or recovered from self-citations. Tool independence and invocation overhead are asserted as design goals but are not used to construct the reported numbers, leaving the evaluation self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the protocol and benchmark names are stated. Tool independence and dynamic invocation are implicit assumptions without supporting evidence in the provided text.

invented entities (2)
  • Embodied Tool Protocol (ETP) no independent evidence
    purpose: Standardized protocol for embodied tool registration, discovery, invocation, and execution
    Introduced as new in abstract; no independent evidence or prior citation provided.
  • EmbodiedToolBench no independent evidence
    purpose: Benchmark to evaluate tool augmentation and model tool-use competence
    Curated for this work; no external validation mentioned.

pith-pipeline@v0.9.1-grok · 5781 in / 1225 out tokens · 23819 ms · 2026-06-29T17:23:47.135915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

117 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding language in robotic affordances

    Ahn et al. Do As I Can, Not As I Say: Grounding language in robotic affordances. InConference on Robot Learning (CoRL), pages 287–318, 2022

  2. [2]

    LLM-as-BT-Planner: Leveraging llms for behavior tree generation in robot task planning

    Ao et al. LLM-as-BT-Planner: Leveraging llms for behavior tree generation in robot task planning. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1233–1239, 2024

  3. [3]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Bhat et al. ZoeDepth: Zero-shot transfer by combining relative and metric depth.arXiv, 2023. doi: 10.48550/arXiv.2302.12288

  4. [4]

    Black et al.π 0: A vision-language-action flow model for general robot control.arXiv, 2024

  5. [5]

    EmbodiedRAG: Dynamic 3d scene graph retrieval for efficient and scalable robot task planning.arXiv, 2024

    Booker et al. EmbodiedRAG: Dynamic 3d scene graph retrieval for efficient and scalable robot task planning.arXiv, 2024

  6. [6]

    Octo: An open-source generalist robot policy.arXiv, 2024

    Brohan et al. Octo: An open-source generalist robot policy.arXiv, 2024

  7. [7]

    Open-vocabulary queryable scene representations for real world planning

    Chen et al. Open-vocabulary queryable scene representations for real world planning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 11509–11522, 2023

  8. [8]

    AutoTAMP: Autoregressive task and motion planning with llms as translators and checkers

    Chen et al. AutoTAMP: Autoregressive task and motion planning with llms as translators and checkers. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 6695–6702, 2024

  9. [9]

    Putting the object back into video object segmentation.arXiv, 2023

    Cheng et al. Putting the object back into video object segmentation.arXiv, 2023. doi: 10.48550/arXiv.2310.12982

  10. [10]

    YOLO-World: Real-time open-vocabulary object detection

    Cheng et al. YOLO-World: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901–16911, 2024

  11. [11]

    Diffusion Policy: Visuomotor policy learning via action diffusion

    Chi et al. Diffusion Policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

  12. [12]

    TAPIR: Tracking any point with per-frame initialization and temporal refinement

    Doersch et al. TAPIR: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10061–10072, 2023

  13. [13]

    PaLM-E: An embodied multimodal language model.arXiv, 2023

    Driess et al. PaLM-E: An embodied multimodal language model.arXiv, 2023

  14. [14]

    AnyGrasp: Robust and efficient grasp perception in spatial and temporal domains

    Fang et al. AnyGrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics, 2023. doi: 10.48550/arXiv.2212.08333

  15. [15]

    RoboNeuron: A modular framework linking foundation models and ros for embodied ai.arXiv, 2025

    Guan et al. RoboNeuron: A modular framework linking foundation models and ros for embodied ai.arXiv, 2025. 10

  16. [16]

    OctoMap: An efficient probabilistic 3d mapping framework based on octrees

    Hornung et al. OctoMap: An efficient probabilistic 3d mapping framework based on octrees. Autonomous Robots, 34:189–206, 2013. doi: 10.1007/s10514-012-9321-0

  17. [17]

    Inner Monologue: Embodied reasoning through planning with language models

    Huang et al. Inner Monologue: Embodied reasoning through planning with language models. arXiv, 2022

  18. [18]

    Visual language maps for robot navigation

    Huang et al. Visual language maps for robot navigation. InProceedings of the IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 10608–10615, 2023

  19. [19]

    Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv, 2023

    Huang et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv, 2023

  20. [20]

    Hughes, Y

    Hughes et al. Hydra: A real-time spatial perception system for 3d scene graph construction and optimization.arXiv, 2022. doi: 10.48550/arXiv.2201.13360

  21. [21]

    Action Genome: Actions as compositions of spatio-temporal scene graphs

    Ji et al. Action Genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10236–10247, 2020

  22. [22]

    LINGO-Space: Language-conditioned incremental grounding for space

    Kim et al. LINGO-Space: Language-conditioned incremental grounding for space. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10314–10322, 2024. doi: 10.1609/aaai.v38i9.28898

  23. [23]

    Large Model Empowered Embodied AI: A survey on decision-making and embodied learning.arXiv, 2025

    Liang et al. Large Model Empowered Embodied AI: A survey on decision-making and embodied learning.arXiv, 2025

  24. [24]

    LLM+P: Empowering large language models with optimal planning proficiency

    Liu et al. LLM+P: Empowering large language models with optimal planning proficiency. arXiv, 2023

  25. [25]

    Lang2LTL: Translating natural language commands to temporal specification with large language models.arXiv, 2023

    Liu et al. Lang2LTL: Translating natural language commands to temporal specification with large language models.arXiv, 2023

  26. [26]

    A survey on vision-language-action models for embodied ai.arXiv, 2024

    Ma et al. A survey on vision-language-action models for embodied ai.arXiv, 2024

  27. [27]

    Orchestrating embodied systems through the embodied context protocol: Motivation, progress, and directions.Research, 2025

    Ma et al. Orchestrating embodied systems through the embodied context protocol: Motivation, progress, and directions.Research, 2025

  28. [28]

    The Marathon 2: A navigation system

    Macenski et al. The Marathon 2: A navigation system. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020. URL https: //github.com/ros-planning/navigation2

  29. [29]

    Embodied large language models enable robots to complete complex tasks in unpredictable environments.Nature Machine Intelligence, 7:592–601, 2025

    Mon-Williams et al. Embodied large language models enable robots to complete complex tasks in unpredictable environments.Nature Machine Intelligence, 7:592–601, 2025

  30. [30]

    ROS-LLM: A ros framework for embodied ai with task feedback and structured reasoning.arXiv, 2024

    Mower et al. ROS-LLM: A ros framework for embodied ai with task feedback and structured reasoning.arXiv, 2024

  31. [31]

    R3M: A Universal Visual Representation for Robot Manipulation

    Nair et al. R3M: A universal visual representation for robot manipulation. InConference on Robot Learning (CoRL), 2022. doi: 10.48550/arXiv.2203.12601

  32. [32]

    GigaPose: Fast and robust novel object pose estimation via one correspondence

    Nguyen et al. GigaPose: Fast and robust novel object pose estimation via one correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9903–9913, 2024

  33. [33]

    Video object segmentation using space-time memory networks

    Oh et al. Video object segmentation using space-time memory networks. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9226–9235, 2019

  34. [34]

    Tool learning with large language models: A survey.Frontiers of Computer Science, 2025

    Qu et al. Tool learning with large language models: A survey.Frontiers of Computer Science, 2025

  35. [35]

    SayPlan: Grounding large language models using 3d scene graphs for scalable task planning

    Rana et al. SayPlan: Grounding large language models using 3d scene graphs for scalable task planning. InConference on Robot Learning (CoRL), pages 23–72, 2023

  36. [36]

    Enabling Novel Mission Operations and Interactions with ROSA: The robot operating system agent

    Royce et al. Enabling Novel Mission Operations and Interactions with ROSA: The robot operating system agent. InIEEE Aerospace Conference, pages 1–16, 2024. 11

  37. [37]

    Towards embodied agentic ai: Review and classification of llm- and vlm-driven robot autonomy and interaction.arXiv, 2025

    Salimpour et al. Towards embodied agentic ai: Review and classification of llm- and vlm-driven robot autonomy and interaction.arXiv, 2025

  38. [38]

    Toolformer: Language models can teach themselves to use tools.arXiv, 2023

    Schick et al. Toolformer: Language models can teach themselves to use tools.arXiv, 2023

  39. [39]

    ProgPrompt: Generating situated robot task plans using large language models

    Singh et al. ProgPrompt: Generating situated robot task plans using large language models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023

  40. [40]

    RoboSpatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv, 2024

    Song et al. RoboSpatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv, 2024

  41. [41]

    Contact-GraspNet: Efficient 6-dof grasp generation in cluttered scenes

    Sundermeyer et al. Contact-GraspNet: Efficient 6-dof grasp generation in cluttered scenes. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 13438–13444, 2021. doi: 10.1109/ICRA48506.2021.9561877

  42. [42]

    Large Language Models for Robotics: Opportunities, challenges, and perspectives

    Wang et al. Large Language Models for Robotics: Opportunities, challenges, and perspectives. arXiv, 2024

  43. [43]

    DUSt3R: Geometric 3d vision made easy

    Wang et al. DUSt3R: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024

  44. [44]

    Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation.arXiv, 2024

    Werby et al. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation.arXiv, 2024

  45. [45]

    SceneGraphFusion: Incremental 3d scene graph prediction from rgb-d sequences

    Wu et al. SceneGraphFusion: Incremental 3d scene graph prediction from rgb-d sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7515–7525, 2021

  46. [46]

    Grounding Generative Planners in Verifiable Logic: A hybrid architecture for trustworthy embodied ai.arXiv, 2026

    Wu et al. Grounding Generative Planners in Verifiable Logic: A hybrid architecture for trustworthy embodied ai.arXiv, 2026

  47. [47]

    Open-Fusion: Real-time open-vocabulary 3d mapping and queryable scene rep- resentation

    Yamazaki et al. Open-Fusion: Real-time open-vocabulary 3d mapping and queryable scene rep- resentation. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. doi: 10.1109/ICRA57147.2024.10610193

  48. [48]

    ReAct: Synergizing reasoning and acting in language models

    Yao et al. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  49. [49]

    Center-based 3d object detection and tracking

    Yin et al. Center-based 3d object detection and tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11784–11793, 2021

  50. [50]

    Large Language Models for Robotics: A survey.arXiv, 2023

    Zeng et al. Large Language Models for Robotics: A survey.arXiv, 2023

  51. [51]

    Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

    Zhao et al. Fast segment anything.arXiv, 2023. doi: 10.48550/arXiv.2306.12156

  52. [52]

    A survey on evaluation of embodied ai

    Liyu Hou, Linyuan Gao, Yuan Wu, and Yi Chang. A survey on evaluation of embodied ai. TechRxiv Preprint, 2026. doi: 10.22541/au.177023340.02874343/v1

  53. [53]

    Moveit motion planning framework

    MoveIt Contributors. Moveit motion planning framework. https://moveit.ai/ , 2024. ROS-based motion planning framework for robotic manipulation

  54. [54]

    OpenVLA: An open vision-language-action model.arXiv, 2024

    OpenVLA Team. OpenVLA: An open vision-language-action model.arXiv, 2024

  55. [55]

    EmbodiedBench: Comprehensive benchmarking multi-modal large language models for vision- driven embodied agents

    Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, and Tong Zhang. EmbodiedBench: Comprehensive benchmarking multi-modal large language models for vision- driven embodied agents. InProceedings of the 42nd International Conference on Machine Learn...

  56. [56]

    Directly solvable states, where the model can produce a valid action based solely on the current observationo t and interaction historyτ t, without resorting to any external tool

  57. [57]

    detect”, “grasp

    Tool-redundant states, where the candidate tool set L is non-empty, yet no tool invocation is required at the current stage of the task. Class Balance.To prevent systematic prediction bias, we maintain a positive-to-negative sample ratio of 1:1 and apply stratified sampling across task types (navigation, planning, and manipulation), ensuring that each sce...

  58. [58]

    Data dependencies: the input parameters of tool zb are derived from the output of tool za, forming a direct dataflow dependency

  59. [59]

    w/ tool” over “w/o tool

    State dependencies: the preconditions of zb require the environmental state produced by the execution of za (e.g., GraspPlanner can only plan a grasping path after ObjectDetector has successfully localized the target). All constraints are formally verified to ensure their logical necessity and completeness. Candidate Tool Set Construction (Distractor Stra...

  60. [61]

    If an object is not visible, use Navigation to locate the object or its likely receptacle before attempting other operations

  61. [62]

    Do not perform actions that violate the validity rules

    Match every action name with its corresponding action id. Do not perform actions that violate the validity rules

  62. [63]

    If previous actions did not lead to success, revise the plan

    Do not repeatedly execute the same action or action sequence. If previous actions did not lead to success, revise the plan

  63. [64]

    Explore alternative instances when needed

    Multiple instances may appear with numeric suffixes, e.g., cabinet 2 or cabinet 3. Explore alternative instances when needed

  64. [65]

    If the last action failed, reflect on the failure reason and adjust the plan

    Use interaction history and environment feedback to refine the current plan. If the last action failed, reflect on the failure reason and adjust the plan

  65. [66]

    Tool outputs are auxiliary evidence only

    When visual evidence is ambiguous, when the target is small or occluded, or when spatial relations are needed, you may use tools such as habitat_toolchain, scene_graph, or yolo_world. Tool outputs are auxiliary evidence only

  66. [67]

    visual_state_description

    Do not output bounding boxes, coordinates, scene-graph nodes, object ids, or raw tool payloads as the final executable plan. After tool use, translate the tool evidence into legal Habitat action ids. Output Format You are supposed to output exactly one JSON object and no surrounding markdown. The output JSON format should be: { "visual_state_description":...

  67. [68]

    Each plan should include no more than 20 actions

    Avoid generating an empty plan. Each plan should include no more than 20 actions

  68. [69]

    Always locate a visible object using the Find action before interacting with it

  69. [70]

    For receptacle placement, prefer Put down rather than Drop

    Match every action name with its corresponding action id. For receptacle placement, prefer Put down rather than Drop

  70. [71]

    If previous actions do not lead to success, modify the plan

    Do not repeatedly execute the same action or sequence of actions. If previous actions do not lead to success, modify the plan

  71. [72]

    Explore alternative instances if the desired object is not found

    Multiple instances may appear with suffixes, e.g., Cabinet_2 or Cabinet_3. Explore alternative instances if the desired object is not found

  72. [73]

    Use history and feedback to identify missing preconditions, such as opening a receptacle, turning on an appliance, or picking up a tool before slicing

  73. [74]

    Tool outputs are auxiliary evidence only

    When the task involves small objects, object attributes, container contents, multiple object instances, or uncertain placement, you may use tools such as alfred_action_advisor, yolo_world, or visual object-tagging tools. Tool outputs are auxiliary evidence only

  74. [75]

    visual_state_description

    Do not echo tool coordinates, masks, boxes, center points, foreground pixels, or raw detector outputs as the final plan. Translate tool results into legal EB-ALFRED action ids. Output Format You are supposed to output exactly one JSON object and no surrounding markdown. The output JSON format should be: { "visual_state_description": string, "reasoning_and...

  75. [76]

    Clearly describe the spatial location of the target object in the observation, such as front-left, front-right, nearby, or far away

    Locate the target object type. Clearly describe the spatial location of the target object in the observation, such as front-left, front-right, nearby, or far away

  76. [77]

    A reachable point can usually be approached through a combination of moving forward, left, and right

    Use forward and lateral motion as the main strategy. A reachable point can usually be approached through a combination of moving forward, left, and right

  77. [78]

    If the forward path is blocked, choose the safest local adjustment

    Consider obstacles before moving. If the forward path is blocked, choose the safest local adjustment

  78. [79]

    Rotate only when the target is not visible or when orientation must be recovered

    Use rotation sparingly. Rotate only when the target is not visible or when orientation must be recovered. Once the target appears, avoid unnecessary rotations

  79. [80]

    Continue moving closer until the robot cannot make additional safe progress toward the target

    Do not stop too early. Continue moving closer until the robot cannot make additional safe progress toward the target

  80. [81]

    If the target is invisible, the robot is stuck, or the route is ambiguous, use tools such as navigation_action_advisor, scene_graph, or query_3d_scene_graph as GPS-like guidance

    Do not rely solely on blind exploration. If the target is invisible, the robot is stuck, or the route is ambiguous, use tools such as navigation_action_advisor, scene_graph, or query_3d_scene_graph as GPS-like guidance

Showing first 80 references.